Every LLM has a token limit. That limit determines how much text, code, and conversation history the model can process in a single request. This page is the definitive reference for every major model's context window, max output tokens, pricing, and the hidden surcharges most comparison tables leave out.
Every LLM's Token Limit (February 2026)
The table below covers every major LLM available through API as of February 2026. Context window is the total input capacity. Max output is the longest response the model can generate in a single call. Pricing is per million tokens.
| Model | Context Window | Max Output | Input $/M | Output $/M |
|---|---|---|---|---|
| GPT-5.2 (OpenAI) | 400K | 128K | $1.75 | $14.00 |
| GPT-5 (OpenAI) | 400K | 128K | $1.25 | $10.00 |
| GPT-5 nano (OpenAI) | 400K | 128K | $0.05 | $0.40 |
| o3 (OpenAI) | 200K | 100K | $0.40 | $1.60 |
| Claude Opus 4.6 (Anthropic) | 200K (1M beta) | 64K | $5.00 | $25.00 |
| Claude Sonnet 4.6 (Anthropic) | 200K (1M beta) | 64K | $3.00 | $15.00 |
| Gemini 2.5 Pro (Google) | 1M | 64K | $1.25 | $10.00 |
| Gemini 3 Pro (Google) | 2M | - | - | - |
| Grok 3 (xAI) | 131K | - | $3.00 | $15.00 |
| Llama 4 Scout (Meta) | 10M | - | Free | - |
| Llama 4 Maverick (Meta) | 1M | - | Free | - |
| DeepSeek R1 (DeepSeek) | 128K | 64K | $0.55 | $2.19 |
| Codestral (Mistral) | 256K | - | $0.30 | $0.90 |
Context window != usable context
A model advertising 200K tokens does not mean it performs well at 200K tokens. Research consistently shows performance degradation well before the stated limit. Models claiming 200K context degrade noticeably around 130K tokens. The stated window is a ceiling, not a performance guarantee.
A few things stand out. OpenAI has the most consistent max output across models (128K). Anthropic and Google offer the largest context windows from commercial providers, but both apply significant pricing surcharges above 200K tokens. Meta's Llama 4 models have the largest raw context windows (1M-10M) but are primarily for self-hosted deployments. DeepSeek R1 and Codestral offer the lowest per-token pricing for API access.
Context Window vs. Max Output: Why Both Matter
The context window is the total number of tokens the model can process in a single request, including both your input and its output. The max output limit caps how long the model's response can be. These are separate constraints, and both can block you.
GPT-5 has a 400K context window but a 128K max output. If you send 380K tokens of input, the model can only generate a 20K token response (400K - 380K). If you need a long output, you must leave room in the context window for it. Claude Opus 4.6 has a 200K context window with 64K max output, so sending more than 136K input tokens means the model's response gets cut short even if there's technically room in the window.
For coding agents and chat applications, the max output limit rarely matters since responses are typically under 4K tokens. But for code generation, document writing, and batch processing tasks, max output becomes the binding constraint. Plan your input budget accordingly.
Open-Weight Models: Context at Scale
Meta's Llama 4 Scout supports 10M tokens, the largest context window of any model in this comparison. But there's a catch: you need the infrastructure to run it. Open-weight models with huge context windows require significant GPU memory. The 10M token window is a theoretical maximum that depends on your hardware provisioning.
For self-hosted deployments, the practical context limit is determined by your available GPU memory, not the model's architecture. A model that supports 10M tokens on paper might only handle 500K on your specific hardware configuration. Always benchmark with your actual deployment before committing to a context window size.
How Tokenization Works
Tokens are not words. A token is a chunk of text that the model processes as a single unit. All major LLMs use some variant of Byte Pair Encoding (BPE), an algorithm that builds a vocabulary of common character sequences from a training corpus. Common words become single tokens. Rare words get split into multiple tokens.
| Provider | Tokenizer | Vocab Size | Notes |
|---|---|---|---|
| OpenAI (GPT-5) | o200k_base | 200K | Most efficient for English. Successor to cl100k_base |
| Anthropic (Claude) | Proprietary BPE | - | Optimized for code and multilingual text |
| Meta (Llama 4) | SentencePiece BPE | 128K | Open-source tokenizer with broad language coverage |
| Google (Gemini) | SentencePiece | 256K | Large vocab for multilingual efficiency |
| Mistral | SentencePiece BPE | 32K | Compact vocabulary, fast tokenization |
Token-to-Text Ratios
The relationship between tokens and human-readable text varies by language and content type. These ratios determine how much actual content fits within a given token limit.
In practice, this means GPT-5's 400K token window holds roughly 280K English words or about 560 pages of text. But the same 400K tokens holds significantly less code, and dramatically less CJK text. If your application handles multilingual input, the effective context window is much smaller than the headline number suggests.
Token counting example (Python)
import tiktoken
# GPT-5 uses o200k_base
enc = tiktoken.get_encoding("o200k_base")
english = "The quick brown fox jumps over the lazy dog"
code = "function handleAuth(req: Request): Promise<Response> {"
chinese = "快速的棕色狐狸跳过了懒狗"
print(f"English: {len(enc.encode(english))} tokens") # ~9 tokens
print(f"Code: {len(enc.encode(code))} tokens") # ~12 tokens
print(f"Chinese: {len(enc.encode(chinese))} tokens") # ~11 tokens
# Same semantic content, very different token counts
# English: ~1.3 tokens/word
# Code: ~1.7 tokens/word
# Chinese: ~2.2 tokens/characterGPT-5's tokenizer is more efficient
GPT-5's o200k_base tokenizer has twice the vocabulary of GPT-4's cl100k_base. Larger vocabularies mean more common sequences get single-token representations, reducing total token count for the same text. If you're counting tokens with the old tokenizer, your estimates will be too high.
How BPE Tokenization Works
Byte Pair Encoding starts with individual bytes and iteratively merges the most frequent adjacent pairs. After training on a large corpus, the tokenizer has a fixed vocabulary of subword units. Common English words like "the" and "function" become single tokens. Rare words get split: "tokenization" might become "token" + "ization", and a rare proper noun might be split into individual characters.
This is why token counts are unpredictable without actually running the tokenizer. A 10-word sentence might be 10 tokens if every word is common, or 25 tokens if it contains technical jargon, URLs, or code. Whitespace and punctuation also consume tokens. A JSON object with many brackets, colons, and quotes uses more tokens than the equivalent plain text.
Practical Implications for Token Budgeting
When building applications against token-limited APIs, rough estimates break down in edge cases. Here are the patterns that consume more tokens than expected:
- URLs and file paths: A URL like
https://api.example.com/v2/users/12345can consume 15+ tokens due to slashes, dots, and numbers being separate tokens. - JSON and structured data: Brackets, quotes, colons, and commas each consume tokens. A compact JSON object uses roughly 2x the tokens of equivalent plain text.
- Base64 and encoded strings: Encoded binary data tokenizes very poorly. A base64 image embedded in context can consume 10-50x more tokens than describing what the image contains.
- Stack traces and error logs: Repetitive paths and line numbers inflate token counts. A single Java stack trace can easily consume 500+ tokens.
What Happens When You Hit the Limit
No major LLM provider silently truncates your input. If your request exceeds the token limit, you get an error. But the error handling and available workarounds differ by provider.
Provider Error Behavior
| Provider | Error Type | Details Provided | Automatic Truncation? |
|---|---|---|---|
| OpenAI | HTTP 400 | Exact token count in error message | No |
| Anthropic | Validation error | Token details in response | No |
| 400 Bad Request | Token count exceeded message | Optional (countTokens API) |
Typical OpenAI token limit error
{
"error": {
"message": "This model's maximum context length is 400000 tokens.
However, your messages resulted in 412847 tokens.
Please reduce the length of the messages.",
"type": "invalid_request_error",
"code": "context_length_exceeded"
}
}Truncation Strategies
When your input exceeds the limit, you need to cut it down. Three common strategies, each with different tradeoffs:
| Strategy | How It Works | Best For | Risk |
|---|---|---|---|
| Stop at limit | Drop everything past the token ceiling | Simple batch processing | Loses recent context (often most relevant) |
| Truncate middle | Keep start + end, remove the middle | Conversations with important system prompt and recent messages | Loses middle context (citations, details) |
| Rolling window | Drop oldest messages, keep most recent | Chat applications, ongoing sessions | Loses early context (setup, instructions) |
All three strategies are lossy. You throw away information and hope the model doesn't need it. Context compression is the alternative: reduce token count while preserving the information content.
Output Token Limits
A separate but related limit: max output tokens. Even if your input fits within the context window, the model's response has its own ceiling. GPT-5 caps output at 128K tokens. Claude caps at 64K. If the model's response exceeds this limit, it gets truncated mid-sentence with a finish_reason: "length" flag in the response.
Detecting output truncation
# Check if the response was cut short due to output token limit
response = client.chat.completions.create(
model="gpt-5",
messages=[...],
max_tokens=4096 # Set explicit output limit
)
if response.choices[0].finish_reason == "length":
# Response was truncated — need to continue or reduce scope
print("Output hit token limit, response is incomplete")
elif response.choices[0].finish_reason == "stop":
# Response completed naturally
print("Full response received")For most chat and coding applications, output limits are not the bottleneck since responses are typically 1-4K tokens. But for code generation, document drafting, and data transformation tasks, you can hit the output ceiling before the context window matters. Set max_tokens explicitly to control output length and catch truncation early.
The Hidden Cost: Long-Context Pricing Tiers
Most LLM pricing pages show a single per-token rate. What they bury in the footnotes: Anthropic and Google both apply steep surcharges when your request crosses 200K tokens. And the surcharge applies to the entire request, not just the tokens above the threshold.
| Provider | Standard Input | Over 200K Input | Standard Output | Over 200K Output |
|---|---|---|---|---|
| Anthropic (Sonnet 4.6) | $3.00 | $6.00 (2x) | $15.00 | $22.50 (1.5x) |
| Anthropic (Opus 4.6) | $5.00 | $10.00 (2x) | $25.00 | $37.50 (1.5x) |
| Google (Gemini 2.5 Pro) | $1.25 | $2.50 (2x) | $10.00 | $20.00 (2x) |
| OpenAI (GPT-5) | $1.25 | $1.25 (no surcharge) | $10.00 | $10.00 (no surcharge) |
The surcharge is all-or-nothing
If your Anthropic request is 201K tokens, the 2x input rate applies to all 201K tokens, not just the 1K over the threshold. A request at 199K tokens costs $0.60 (Sonnet). A request at 201K tokens costs $1.21. That's a 2x jump for 2K extra tokens. Stay under 200K.
The Math on Compression ROI
Consider a coding agent session using Claude Sonnet 4.6 that accumulates 250K input tokens. Without compression, the entire request hits the 2x tier: 250K x $6.00/M = $1.50 per request. With Morph Compact reducing tokens by 50%, the input drops to 125K tokens at the standard rate: 125K x $3.00/M = $0.38. That's a 75% cost reduction, not just from fewer tokens but from avoiding the surcharge tier entirely.
Cost comparison: with and without compression
# Claude Sonnet 4.6 — coding agent session, 250K input tokens
# WITHOUT compression:
# Entire request hits 2x surcharge tier (>200K)
input_cost = 250_000 * ($6.00 / 1_000_000) = $1.50
output_cost = 4_000 * ($22.50 / 1_000_000) = $0.09
total = $1.59 per request
# WITH Morph Compact (50% reduction):
# 125K tokens — stays under 200K threshold
input_cost = 125_000 * ($3.00 / 1_000_000) = $0.38
output_cost = 4_000 * ($15.00 / 1_000_000) = $0.06
total = $0.44 per request
# Savings: $1.15 per request (72% reduction)
# Over 100 agent sessions/day: $115/day, $3,450/monthThe Lost-in-the-Middle Problem
Even if you stay within the token limit, long contexts degrade model performance. The "lost-in-the-middle" phenomenon, documented across every major model family, shows that LLMs attend well to the beginning and end of their input but lose accuracy for information positioned in the center.
This means a model with a 200K context window does not give you 200K tokens of reliable working memory. Research consistently shows degradation starting well before the stated limit. A 200K model starts showing measurable quality loss around 130K tokens. The tokens in the middle of a long prompt are the most likely to be missed or misinterpreted.
For coding agents, this is especially problematic. An agent accumulates context over many turns: file reads, grep results, tool outputs, error traces, and prior conversation. The critical piece of information from ten turns ago might be sitting in the exact middle of the context window, right where the model is least likely to attend to it.
Context rot compounds over turns
As context rot research shows, model performance degrades as input length increases, even when the window is not full. Every irrelevant token makes the model worse at attending to the tokens that matter. The solution is not a bigger context window. It's keeping the context clean.
This is why context compression matters even when you have space left in the window. Compression is not just about fitting more in. It's about removing noise so the model can focus on signal. A 100K context with high information density outperforms a 200K context diluted with irrelevant tool outputs.
Practical Impact on Agent Architectures
The lost-in-the-middle effect has direct architectural consequences. If your agent reads 20 files into context across a multi-step task, the files read in the middle of the session are the ones most likely to be forgotten. This creates a pattern where agents succeed on the first and last steps of a task but fail on intermediate steps that depend on mid-context information.
Several mitigation strategies exist beyond compression. Placing critical information at the beginning or end of the prompt helps. Re-inserting important context at decision points forces the model to attend to it. And reducing total context length shifts all content closer to the attention-favored positions at the edges.
Mitigating lost-in-the-middle in agent loops
// Anti-pattern: dump everything into context and hope for the best
const messages = [systemPrompt, ...allToolOutputs, userQuery];
// Better: compact tool outputs and re-insert critical context
const compactedOutputs = await Promise.all(
toolOutputs.map(output =>
output.tokens > 500 ? morph.compact(output.content) : output.content
)
);
// Place the current task description near the end (attention-favored position)
const messages = [
systemPrompt,
...compactedOutputs,
{ role: "user", content: `Current task: ${taskDescription}\n\nRelevant context re-stated: ${criticalContext}` }
];Strategies for Working Within Token Limits
Six practical approaches for staying within token limits without losing the information your application needs.
1. Chunking
Split large documents into smaller chunks that fit within the context window. Process each chunk independently, then combine results. Works well for summarization and extraction tasks where each chunk is self-contained. Falls apart when the answer depends on information spread across multiple chunks.
2. RAG (Retrieval-Augmented Generation)
Instead of stuffing the entire knowledge base into the context window, embed documents into a vector store and retrieve only the relevant chunks at query time. This keeps token usage proportional to the query, not the corpus size. The quality ceiling depends on the retrieval step: if the right chunks are not retrieved, the model cannot use them.
3. Sliding Window / Rolling Context
For chat applications, drop the oldest messages when the conversation exceeds the token limit. The model always sees the most recent context. The tradeoff: early instructions and context are lost. Partially mitigated by keeping the system prompt pinned and summarizing dropped messages.
4. Summarization
Use the LLM itself to summarize prior context into a shorter form. Anthropic's Claude Code and OpenAI's Codex both use this approach for long coding sessions. The risk: summarization rewrites the original text, and rewriting introduces the possibility of altered code, mangled file paths, or hallucinated details.
5. Verbatim Compaction
Delete low-signal tokens while keeping every surviving sentence word-for-word identical to the original. Morph Compact achieves 50-70% token reduction at 3,300+ tokens per second with 98% verbatim accuracy. Unlike summarization, nothing is rewritten, so there is zero hallucination risk in the compacted output. The tradeoff: lower compression ratio than summarization (50-70% vs 70-90%).
6. Hybrid Approaches
Combine strategies for different parts of the input. Use RAG for the knowledge base. Compact tool outputs inline. Summarize only the oldest turns of conversation history where exact details matter less. The most effective production systems layer multiple strategies rather than relying on a single approach.
Hybrid approach: RAG + inline compaction
// 1. RAG: retrieve only relevant documents (not the whole corpus)
const relevantDocs = await vectorStore.query(userQuery, { topK: 5 });
// 2. Compact long tool outputs inline
const toolResults = await Promise.all(
pendingTools.map(async (tool) => {
const result = await executeTool(tool);
if (estimateTokens(result.output) > 500) {
return await morph.compact(result.output); // verbatim compaction
}
return result.output;
})
);
// 3. Summarize oldest conversation turns (exact details less critical)
const recentTurns = conversation.slice(-10); // keep recent turns verbatim
const oldSummary = await summarize(conversation.slice(0, -10));
// 4. Assemble context within token budget
const messages = [
{ role: "system", content: systemPrompt },
{ role: "assistant", content: oldSummary }, // summarized old context
...recentTurns, // exact recent context
{ role: "user", content: relevantDocs.join("\n") }, // RAG results
...toolResults.map(r => ({ role: "tool", content: r }))
];| Strategy | Token Reduction | Information Loss | Hallucination Risk |
|---|---|---|---|
| Chunking | High (per-chunk) | Context between chunks | None |
| RAG | Very high | Depends on retrieval quality | None |
| Sliding window | Variable | Oldest context dropped | None |
| Summarization | 70-90% | Details rewritten | Medium |
| Verbatim compaction | 50-70% | Low-signal tokens removed | Zero |
| Hybrid | Highest | Minimized | Depends on mix |
Token Counting Tools
You need to know your token count before sending a request, not after you get a 400 error. These tools let you count tokens locally for accurate estimation.
Python
Python token counting libraries
# tiktoken — OpenAI's official tokenizer (fastest)
pip install tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-5")
tokens = enc.encode("Your text here")
print(f"{len(tokens)} tokens")
# token-counter — multi-provider support
pip install token-counter
from token_counter import TokenCounter
counter = TokenCounter(model="claude-sonnet-4-6")
count = counter.count("Your text here")
# HuggingFace transformers — Llama, Mistral, open models
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout")
tokens = tokenizer.encode("Your text here")
print(f"{len(tokens)} tokens")JavaScript / TypeScript
JavaScript token counting libraries
// js-tiktoken — tiktoken port for JS (works in browser + Node)
import { encodingForModel } from "js-tiktoken";
const enc = encodingForModel("gpt-5");
const tokens = enc.encode("Your text here");
console.log(`${tokens.length} tokens`);
// @xenova/transformers — Llama, Mistral, open models in JS
import { AutoTokenizer } from "@xenova/transformers";
const tokenizer = await AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout");
const { input_ids } = await tokenizer("Your text here");
console.log(`${input_ids.length} tokens`);Online Tools
For quick checks without code: OpenAI's Tokenizer visualizes token boundaries for GPT models. It shows exactly where the tokenizer splits your text, which is useful for understanding why some inputs use more tokens than expected.
Pre-request Token Counting in Production
In production systems, count tokens before sending the API request. This lets you truncate, compress, or split the request proactively rather than handling errors reactively. Most SDKs provide token counting methods.
Pre-request token budget management
import tiktoken
MODEL = "gpt-5"
MAX_CONTEXT = 400_000
MAX_OUTPUT = 4_096 # reserve for response
INPUT_BUDGET = MAX_CONTEXT - MAX_OUTPUT
enc = tiktoken.encoding_for_model(MODEL)
def count_messages_tokens(messages):
"""Count tokens across all messages including overhead."""
total = 0
for msg in messages:
total += 4 # message framing overhead
total += len(enc.encode(msg["content"]))
total += 2 # assistant reply priming
return total
# Check before sending
token_count = count_messages_tokens(messages)
if token_count > INPUT_BUDGET:
# Compact instead of truncating
messages = compress_context(messages, target=INPUT_BUDGET)
response = client.chat.completions.create(
model=MODEL,
messages=messages,
max_tokens=MAX_OUTPUT
)Count tokens for the right model
Token counts vary between models because they use different tokenizers. Text that's 1,000 tokens on GPT-5 (o200k_base) might be 1,200 tokens on Claude or 900 tokens on Gemini. Always count using the tokenizer for the model you're actually calling.
Frequently Asked Questions
What is the largest LLM context window available in 2026?
Llama 4 Scout from Meta has the largest context window at 10 million tokens. Gemini 3 Pro from Google supports 2 million tokens. GPT-5 and GPT-5.2 from OpenAI support 400K tokens. Larger context windows do not automatically mean better performance. Models degrade well before their stated limits, and providers like Anthropic and Google apply 2x pricing surcharges above 200K tokens.
How many tokens is 1,000 words?
In English, 1,000 words is roughly 1,300 to 1,500 tokens using modern BPE tokenizers. One token averages about 4 characters or 0.7 words. Code tokenizes less efficiently at 1.5 to 2.0 tokens per word. CJK languages consume 2 to 8 times more tokens than English for equivalent content.
What happens when you exceed an LLM's token limit?
OpenAI returns an HTTP 400 error with the exact token count. Anthropic returns a validation error with token details. Neither provider silently truncates your input. You need to reduce the input length before retrying, either by truncating, using a rolling window, or compressing the context.
Why do LLMs perform worse with longer contexts?
Models exhibit the "lost-in-the-middle" problem: 30%+ accuracy drop for information positioned in the middle of long contexts. The model attends well to the beginning and end but struggles with content buried in the center. Additionally, attention quality degrades as input length increases. Models claiming 200K context windows often degrade noticeably around 130K tokens.
Do any LLM providers charge more for long contexts?
Yes. Anthropic charges 2x input and 1.5x output above 200K tokens, applied to all tokens in the request, not just the overflow. Google applies a similar 2x surcharge above 200K with the same all-or-nothing pricing. OpenAI does not have surcharge tiers. Staying under 200K is a significant cost optimization, and Morph Compact can keep you there with 50-70% token reduction.
How can I reduce token usage without losing information?
Morph Compact reduces token count by 50-70% through verbatim compaction: it deletes low-signal tokens while keeping every surviving sentence word-for-word identical to the original. Unlike summarization, there is zero hallucination risk. It runs at 3,300+ tokens per second with 98% verbatim accuracy. Other strategies include chunking, RAG, and rolling windows, but these discard information rather than compressing it. See the LLM cost optimization guide for detailed cost strategies.
Stay Under the Token Limit Without Losing Context
Morph Compact reduces token count by 50-70% through verbatim compaction. No summarization, no hallucination risk. Every surviving sentence is word-for-word identical to the original. Stay under Anthropic and Google's 200K surcharge threshold.