An LLM context window is the maximum number of tokens a model can process in one request. It is the model's working memory: your prompt, system instructions, conversation history, tool outputs, and the model's own response all share the same budget. Context windows have grown 20,000x since 2018, from 512 tokens to 10 million. But bigger does not mean better, and the number on a model card is not the number you actually get.
What Is a Context Window
A context window defines how much text an LLM can see at once. Every piece of information the model processes in a single request must fit within this budget: your prompt, the system instructions that define the model's behavior, any conversation history from previous turns, files or documents you include, tool call results, and the model's generated response.
The window is measured in tokens, not words or characters. When you send a request to an LLM API, the total token count of input plus output cannot exceed the context window. If it does, the request fails or the input gets truncated.
A critical point: the context window is not persistent memory. Every API call starts from scratch. The model does not "remember" previous requests. If you want continuity across turns in a chat, you must re-send the entire conversation history with each request. This is why long sessions accumulate tokens fast and why context compression becomes essential for production systems.
Context window vs memory
A 200K-token context window does not mean the model has 200K tokens of memory. It means the model can process 200K tokens right now, in this single request. Next request, it starts empty again. Conversation history, system prompts, and tool outputs must be re-sent every time.
How Tokenization Works
LLMs never see raw text. Before processing, text is converted into tokens by a tokenizer. Most modern models use Byte-Pair Encoding (BPE), an algorithm that starts with individual characters and iteratively merges the most frequent adjacent pairs until reaching a target vocabulary size, typically 30K to 100K tokens.
The result: common words like "the" become a single token, while rare or compound words get split into multiple tokens. "tokenization" might become ["token", "ization"]. Code tokenizes less efficiently than prose because special characters ({ } => //), indentation, and camelCase variable names produce more tokens per character.
| Context Window | ~Words | ~Pages of Text | ~Lines of Code |
|---|---|---|---|
| 4K tokens | 3,000 | ~10 pages | ~2,000-3,000 |
| 32K tokens | 24,000 | ~80 pages | ~16,000-22,000 |
| 128K tokens | 96,000 | ~300 pages | ~50,000-70,000 |
| 1M tokens | 750,000 | ~2,500 pages | ~400,000-550,000 |
| 10M tokens | 7,500,000 | ~25,000 pages | ~4-5.5 million |
Different models use different tokenizers, so the same text produces different token counts depending on the model. OpenAI's tiktoken, Anthropic's tokenizer, and Google's SentencePiece each split text differently. Always count tokens with the target model's tokenizer, not a generic estimate.
Context Window Sizes in 2026
Context windows have grown 20,000x in eight years. GPT-1 launched with 512 tokens in 2018. GPT-3.5 pushed to 16K. By early 2023, most models operated between 4K and 8K. Then the race accelerated: 128K (GPT-4 Turbo), 200K (Claude 3), 1M (Gemini 1.5 Pro), and now 10M (Llama 4 Scout).
| Year | Model | Context Window | Increase |
|---|---|---|---|
| 2018 | GPT-1 | 512 tokens | Baseline |
| 2020 | GPT-3 | 2,048 tokens | 4x |
| 2023 (Mar) | GPT-4 | 8,192 tokens | 16x |
| 2023 (Nov) | GPT-4 Turbo | 128K tokens | 250x |
| 2024 (Mar) | Claude 3 | 200K tokens | 390x |
| 2024 (Feb) | Gemini 1.5 Pro | 1M tokens | 1,950x |
| 2025 (Apr) | Llama 4 Scout | 10M tokens | 19,500x |
For a full breakdown of every model's context window, pricing, and max output tokens, see our LLM Context Window Comparison table.
Budget Tier
DeepSeek V3 (128K), GPT-4.1 Mini (1M), Gemini 2.5 Flash (1M). Best token-per-dollar ratio for high-volume workloads.
Premium Tier
Claude Opus 4.6 (200K, 1M beta), GPT-5.2 (400K), Gemini 2.5 Pro (1M). Strongest reasoning, highest per-token cost.
Maximum Context
Llama 4 Scout (10M), Grok 4 (2M). Largest windows available, but effective context is far smaller than advertised.
Input vs Output: The Shared Budget
The context window is a shared budget. Input tokens (what you send) and output tokens (what the model generates) both count against the same limit. Most models set separate caps for each.
| Model | Total Context | Max Output | Effective Input Limit |
|---|---|---|---|
| GPT-5.2 | 400K | 128K | 272K |
| Claude Opus 4.6 | 200K (1M beta) | 64K | 136K (936K beta) |
| Gemini 2.5 Flash | 1M | 8K | 992K |
| GPT-4.1 | 1M | 32K | 968K |
| DeepSeek R1 | 128K | 64K | 64K |
This distinction matters for different workloads. A model with 1M context but 8K max output (Gemini 2.5 Flash) can ingest an entire codebase but generates short responses. A model with 400K context and 128K max output (GPT-5.2) generates longer responses, which matters for multi-file code edits or long-form content.
Why max output matters for coding agents
A coding agent doing a multi-file refactor might need to output 500+ lines of code across several files in one turn. If the model's max output is 8K tokens (~200-300 lines), it must split the work across multiple turns, each carrying the full conversation history. More turns means more accumulated context means more context rot.
The Lost-in-the-Middle Problem
Having a large context window means nothing if the model cannot use it uniformly. Liu et al.'s research (Stanford, published in TACL 2024) demonstrated that LLM performance follows a U-shaped curve across the context. Models attend strongly to the beginning and end of the input but drop 30%+ in accuracy on information placed in the middle.
| Position | Accuracy | What Happens |
|---|---|---|
| Start (Position 1) | ~75% | Primacy bias: strong attention to early tokens |
| Middle (Position 10) | ~55% | Blind spot: model loses track of middle content |
| End (Position 20) | ~72% | Recency bias: strong attention to late tokens |
The root cause is architectural. Rotary Position Embedding (RoPE), used by most modern LLMs, introduces a long-term decay effect that naturally de-emphasizes middle positions. This is not a bug that will be patched. It is a structural property of the attention mechanism.
For practical use, this means: if your most important information lands in the middle third of a long prompt, the model is significantly less likely to use it correctly. This is especially damaging for coding agents, where the relevant file is often discovered mid-search and sits in the middle of accumulated context. Read more in our deep dive on the lost-in-the-middle effect.
Context Window Overhead
The advertised context window is not all yours. Several sources of overhead consume tokens before your actual content:
Where your context budget actually goes
Total context window: 200,000 tokens
─────────────────────────────────────────────────────
System prompt (behavior, tools, format): -3,000 tokens
Conversation history (15 turns): -25,000 tokens
Tool call results (file reads, searches): -12,000 tokens
Reserve for model output: -64,000 tokens
─────────────────────────────────────────────────────
Available for your actual question: 96,000 tokensSystem prompts define the model's behavior, available tools, output formatting, and constraints. In production, these run 2,000-5,000 tokens. They repeat with every API call in a multi-turn conversation.
Conversation history grows linearly with each turn. By turn 15 of a chat, you could have 25,000-40,000 tokens of history before the user asks anything new.
Tool outputs from coding agents are the biggest consumer. Reading files, running searches, and executing commands inject thousands of tokens per action. A single file read might add 500-2,000 tokens. Ten file reads across a debugging session adds 5,000-20,000 tokens that stay in the window permanently.
Why Bigger Windows Don't Solve the Problem
The intuitive response to context limitations is: make the window bigger. The research says this does not work.
Performance Degrades at Every Length
Chroma's research tested 18 frontier models and found that every model degrades as context grows. Not just near the limit. At every increment. A model with 1M tokens of capacity still shows context rot at 50K.
RULER Benchmark Reality Check
The RULER benchmark tests retrieval accuracy at increasing context lengths. The gap between advertised context and effective context is often enormous:
| Model | Claimed Context | Score @ 4K | Score @ 128K | Drop |
|---|---|---|---|---|
| Gemini 1.5 Pro | 1M | 96.7 | 94.4 | -2.3 pts |
| GPT-4-1106 | 128K | 96.6 | 81.2 | -15.4 pts |
| Llama 3.1-70B | 128K | 96.5 | 66.6 | -29.9 pts |
| Mixtral-8x22B | 64K | 95.6 | 31.7 | -63.9 pts |
Mixtral-8x22B, despite advertising a 64K window, produces near-random results at 128K. You are not using the same model at every context length. The model at 128K is measurably worse than the model at 4K. For more data, see our full context window comparison.
Compression Beats Expansion
CompLLM research demonstrated that 2x compressed context surpasses uncompressed performance on long sequences. The mechanism is simple: removing noise improves signal-to-noise ratio. The retrieve-then-solve approach improved Mistral from 35.5% to 66.7% accuracy by selecting relevant context instead of sending everything. Less input, better output.
Context Windows for Coding Agents
Coding agents stress context windows harder than any other use case. A typical agentic coding session accumulates context like sediment:
Context accumulation in a coding agent session
Turn 1: Read issue, system prompt → 3,500 tokens
Turn 2: Grep for relevant files, read 4 matches → +8,000 tokens
Turn 3: Need more context, read 3 related files → +6,000 tokens
Turn 4: Backtrack, read test files for patterns → +5,000 tokens
Turn 5: Found the right file, but now carrying → 22,500 tokens
↑ 80% of this is irrelevant search debrisCognition measured this directly: agents spend over 60% of their first turn retrieving context, not reasoning or editing. An OpenReview study found that some agents consume 10x more tokens than others on equivalent tasks. The variance was driven by search efficiency, not coding ability.
The longer a coding session runs, the worse it gets. Research on long-running agents shows that every agent's success rate decreases after 35 minutes, and doubling task duration quadruples the failure rate. The cause is accumulated context noise.
For more on how the Claude Code context window handles this, and how context engineering applies to agentic workflows, see our agentic context engineering guide.
Managing Context in Production
The solution to context window limitations is not bigger windows. It is better context management. Three strategies work:
Context Compression
Reduce token count by 50-70% before sending to the model. Morph Compact keeps every surviving sentence verbatim at 98% accuracy and 3,300+ tok/s. Less noise means better output.
Context Isolation
Delegate search to subagents running in their own context windows. The main model never sees exploration dead-ends. Anthropic's multi-agent system improved performance by 90% using this approach.
Selective Retrieval
Send only relevant code to the model, not entire files. WarpGrep returns precise file and line ranges, reducing context rot by 70% while speeding up task completion by 40%.
Hidden Pricing Cliff at 200K Tokens
Beyond quality, context management has a direct cost impact. Both Anthropic and Google charge 2x on input tokens when any part of a request exceeds 200K tokens. This surcharge applies to every token in the request, not just the overage. Crossing 200K by one token doubles your input cost for the entire request.
OpenAI charges no surcharge at any length. For workloads that consistently exceed 200K tokens, this is a meaningful cost difference. At 1 billion tokens per month, the gap between DeepSeek V3 ($140) and Claude Opus 4.6 ($5,000+) is 35x before surcharges. With surcharges on long-context requests, Opus can reach $10,000. See our full pricing breakdown.
Morph Compact for Context Reduction
Morph Compact reduces context by 50-70% while preserving every surviving sentence word-for-word. No paraphrasing, no summarization, no hallucination risk. At 3,300+ tokens per second with 98% verbatim accuracy, it pays for itself by cutting the token cost on every subsequent request in a session. For teams spending $3,000-5,000 per billion tokens on premium models, compacting context first can cut that to $1,000-2,500 while improving output quality.
Frequently Asked Questions
What is an LLM context window?
An LLM context window is the maximum number of tokens a model can process in a single request. It functions as working memory: your prompt, system instructions, conversation history, and the model's response all share the same budget. As of 2026, context windows range from 128K tokens (DeepSeek, Mistral) to 10 million (Llama 4 Scout).
How many words fit in a context window?
One token is roughly 0.75 English words, or about 4 characters. A 128K token context window holds about 96,000 words (300 pages of text) or 50,000-70,000 lines of code. A 1 million token window holds roughly 750,000 words. Code tokenizes less efficiently than prose because of special characters and indentation.
What is the difference between context window and max output tokens?
The context window is the total token budget shared between input and output. Max output tokens is the ceiling on how much the model can generate. GPT-5.2 has a 400K context window with 128K max output, meaning input is capped at 272K tokens. A model with large context but small max output can read a lot but writes short responses per turn.
Why do LLMs perform worse with longer context?
Three factors cause degradation. The lost-in-the-middle effect means models attend poorly to information in the middle of the input (30%+ accuracy drop). Attention dilution means quadratic scaling creates exponentially more pairwise relationships. And semantically similar distractors interfere with retrieval. Chroma tested 18 frontier models and every one degraded.
What is the lost-in-the-middle problem?
Research by Liu et al. (Stanford, TACL 2024) showed that LLM accuracy follows a U-shaped curve. Models attend strongly to tokens at the beginning and end but drop 30%+ on information in the middle of the context. For coding agents, this means relevant code found mid-search may sit in the model's blind spot.
Is a bigger context window always better?
No. Performance degrades at every context length increment, not just near the limit. RULER benchmarks show GPT-4-1106 drops 15 points from 4K to 128K. CompLLM research showed 2x compressed context surpasses uncompressed performance on long sequences because removing noise improves signal quality.
How do tokens work in LLMs?
Tokens are sub-word units created by a tokenizer, typically using Byte-Pair Encoding (BPE). BPE starts with individual characters and merges the most frequent adjacent pairs until reaching a vocabulary of 30K-100K tokens. Different models use different tokenizers, so the same text produces different token counts. Always count tokens with the target model's specific tokenizer.
How can I reduce context window usage without losing quality?
Morph Compact reduces context by 50-70% while keeping every surviving sentence verbatim (98% accuracy, 3,300+ tok/s). Research shows compressed context can improve output quality by reducing noise. Subagent architectures that isolate search into separate context windows reduce context rot by 70%.
Use Less Context. Get Better Results.
Morph Compact reduces context by 50-70% while keeping every surviving sentence verbatim. Cut your token costs, stay under pricing surcharge thresholds, and improve output quality at the same time.