What Is Few-Shot Prompting
Adding few-shot examples to GPT-4o code generation prompts pushed pass@1 from 47.1% to 57.5% on the CodePromptEval benchmark (7,072 prompts across three models). That is a 22% relative improvement from changing nothing about the model, the task, or the infrastructure. Just examples.
Few-shot prompting means including 2-5 input-output demonstrations in your prompt before the actual request. The model uses these demonstrations to infer the pattern you want: the format, the reasoning style, the conventions. Brown et al. coined the term in their 2020 paper "Language Models are Few-Shot Learners," showing that GPT-3 (175B parameters) could match fine-tuned models on dozens of NLP benchmarks by simply prepending examples to the prompt.
The mechanism is called in-context learning. The model does not update its weights. It conditions its next-token predictions on the examples you provided, effectively "learning" the task at inference time. This is why example quality matters more than example quantity. Three well-chosen demonstrations that cover the input space outperform ten redundant ones.
Zero-Shot vs One-Shot vs Few-Shot
The terminology maps directly to how many examples you put in the prompt. Zero means none, one means one, few means 2-5 (sometimes up to 10). Each step up trades context window space for accuracy.
| Approach | Examples | Accuracy Pattern |
|---|---|---|
| Zero-shot | 0 examples. Instructions only. | GPT-3 TriviaQA: 64.3%. Relies entirely on pretraining knowledge and instruction-following. |
| One-shot | 1 example. Minimal pattern. | GPT-3 TriviaQA: 68.0%. The single largest jump per example. Establishes format and expectation. |
| Few-shot | 2-5 examples. Full pattern coverage. | GPT-3 TriviaQA: 71.2%. Covers edge cases and reduces ambiguity. Diminishing returns after 3-5. |
| Many-shot | 100+ examples. Requires large context. | Gemini 1.5 Pro code gen: 42% → 62% (up to 8,192 shots). Requires 1M+ token context windows. |
The jump from zero-shot to one-shot is almost always the largest per-example gain. On GPT-3 TriviaQA, that single example added 3.7 percentage points. The next few examples added 3.2 points total. After 5 examples, each additional one typically adds fractions of a point.
When Zero-Shot Beats Few-Shot
Few-shot is not universally better. Min et al. (2022) found that on complex reasoning tasks, few-shot examples can actually mislead the model if the examples do not precisely match the reasoning pattern needed. Chain-of-thought prompting (where you show the model your reasoning steps, not just input-output pairs) outperforms pure few-shot on multi-step math and logic problems. For code generation, few-shot works well because code follows predictable structural patterns.
Why Few-Shot Prompting Matters for Code Generation
General NLP tasks use few-shot prompting to teach the model what to do. Code generation tasks use it to teach the model how your team does it. The distinction matters.
A coding agent already knows how to write a Python function. It knows the syntax, common libraries, and standard patterns. What it does not know is that your team uses snake_case for private methods but camelCase for API handlers. It does not know you wrap all database calls in a try/except with a custom DatabaseError. It does not know your React components return early on loading states before rendering.
Few-shot examples encode these conventions implicitly. Show the model three functions from your codebase, and it infers the patterns without you having to write explicit rules for each one.
Convention Inheritance
Show 3 examples of your error handling pattern. The model generates the 4th function with the same try/catch structure, the same error types, the same logging format. No style guide needed.
Architecture Alignment
Include examples that use your abstractions (your ORM helpers, your state management, your API client). The model uses the same imports and method signatures instead of inventing its own.
Test Pattern Matching
Show 2-3 test files that follow your team's testing conventions (describe/it blocks, custom matchers, fixture patterns). Generated tests match the structure your CI expects.
Code Review Compliance
Examples that pass your code review standards teach the model those standards. The result: fewer review rounds, because the generated code already looks like team-written code.
The Coding Agent Context Problem
Coding agents like Claude Code, Aider, and Cursor face a specific version of this problem. They operate over entire repositories, but their context windows hold a fraction of the codebase. The files they choose to include serve as implicit few-shot examples. An agent that reads your auth.ts, database.ts, and api.ts before writing billing.ts generates code that matches those files' patterns. An agent that reads only the task description generates generic code. Context selection is few-shot prompting by another name.
3 Practical Few-Shot Examples for Code Generation
Theory is useful. Concrete examples that you can copy and adapt are more useful. Here are three few-shot patterns that measurably improve code generation quality for the most common tasks.
Example 1: Generating Functions That Match Your Codebase Style
The goal: generate a new utility function that follows the same conventions as your existing utilities. Instead of describing the conventions in prose (which is error-prone and verbose), show the model what they look like.
Few-shot prompt: codebase style matching
Here are examples of utility functions in our codebase:
// Example 1: Input validation
export function validateEmail(input: string): Result<string> {
if (!input || !input.includes("@")) {
return { ok: false, error: "Invalid email format" };
}
return { ok: true, value: input.trim().toLowerCase() };
}
// Example 2: Data transformation
export function parseApiResponse<T>(raw: unknown): Result<T> {
if (!raw || typeof raw !== "object") {
return { ok: false, error: "Invalid response shape" };
}
return { ok: true, value: raw as T };
}
// Example 3: Resource lookup
export function findUserById(users: User[], id: string): Result<User> {
const user = users.find(u => u.id === id);
if (!user) {
return { ok: false, error: `User ${id} not found` };
}
return { ok: true, value: user };
}
Now write a utility function: parseDateRange that takes
a string like "2024-01-01..2024-12-31" and returns
a Result<{ start: Date; end: Date }>.The model sees three things from these examples: (1) all functions return Result<T>, (2) error cases are handled first with early returns, (3) the error message format uses template literals with specific context. The generated parseDateRange will follow all three patterns without you specifying any of them.
Example 2: Teaching Test Conventions
Test generation is where few-shot prompting has the highest ROI. Without examples, models generate generic tests using whatever framework they default to. With 2-3 examples from your test suite, they match your exact setup.
Few-shot prompt: test generation
Here are tests from our codebase:
describe("validateEmail", () => {
it("returns ok for valid email", () => {
const result = validateEmail("user@example.com");
expect(result).toEqual({ ok: true, value: "user@example.com" });
});
it("returns error for missing @", () => {
const result = validateEmail("invalid");
expect(result).toEqual({ ok: false, error: "Invalid email format" });
});
it("trims and lowercases", () => {
const result = validateEmail(" User@Example.COM ");
expect(result).toEqual({ ok: true, value: "user@example.com" });
});
});
describe("findUserById", () => {
const users = [testUser({ id: "1" }), testUser({ id: "2" })];
it("returns user when found", () => {
const result = findUserById(users, "1");
expect(result.ok).toBe(true);
});
it("returns error for missing user", () => {
const result = findUserById(users, "999");
expect(result).toEqual({ ok: false, error: "User 999 not found" });
});
});
Write tests for parseDateRange following the same patterns.The examples show the model: use describe/it blocks (not test()), use toEqual for object comparisons, include positive case, negative case, and edge case. The model also picks up that your tests use a testUser factory helper, which tells it your codebase has test factories it should use.
Example 3: Code Review with Specific Standards
Few-shot code review prompts teach the model what your team actually flags in PRs, not generic best practices from documentation.
Few-shot prompt: code review
Review code using these standards. Here are examples:
// Code: const data = await fetch(url).then(r => r.json())
// Review: Missing error handling. Wrap in try/catch,
// check r.ok before parsing. Our convention: throw
// ApiError with status code and endpoint.
// Code: function process(items) { ... }
// Review: Missing TypeScript types. Parameter 'items'
// needs explicit type. Return type should be declared.
// Our convention: no implicit any, even in internal code.
// Code: if (user.role == "admin") { ... }
// Review: Use strict equality (===). Loose equality is
// banned in our ESLint config. Also: prefer
// user.role === Role.ADMIN using the Role enum.
Now review this code:
async function getUsers() {
const res = await fetch("/api/users")
const data = res.json()
return data.filter(u => u.active)
}Without examples, the model gives generic advice: "consider adding error handling." With examples, it gives specific advice in your team's voice: "Wrap in try/catch, check res.ok before parsing, throw ApiError with status code and endpoint." It also catches the missing await on res.json() and flags the missing type annotation on the return value.
How Many Shots Is Optimal
The research gives a clear answer with an important caveat. Three to five examples is optimal for most tasks. But the optimal number depends on the specific combination of model, task, and example quality.
| Examples | Typical Impact | Token Cost |
|---|---|---|
| 0 (zero-shot) | Baseline. Model relies on pretraining only. | 0 extra tokens |
| 1 (one-shot) | Largest per-example gain. Establishes format and expectations. | ~50-200 tokens |
| 3 (few-shot) | Covers primary patterns and 1-2 edge cases. Sweet spot for most tasks. | ~150-600 tokens |
| 5 (few-shot) | Marginal gains over 3. Worth it for complex or ambiguous tasks. | ~250-1000 tokens |
| 10+ | Diminishing returns. Can degrade performance if examples are redundant. | ~500-2000+ tokens |
A key finding from the "Does Few-Shot Learning Help LLM Performance in Code Synthesis?" study (December 2024): strategically selected examples improved CodeLlama pass@1 by 5.7 percentage points, while random examples sometimes hurt performance. The selection method matters as much as the count.
The Over-Prompting Trap
Recent research on "The Few-shot Dilemma" found that incorporating excessive domain-specific examples can paradoxically degrade performance in certain LLMs. This contradicts the intuition that more examples always help. The mechanism: redundant examples consume attention budget without adding new information, and the model starts overfitting to surface-level patterns in the examples rather than learning the underlying task structure. Test with 3 examples first. Only add more if you can measure the improvement.
Google DeepMind: Many-Shot Results
Google DeepMind's 2024 many-shot study (NeurIPS Spotlight) tested Gemini 1.5 Pro with up to 8,192 examples using its 1M-token context window. Code generation scaled from 42% to 62% accuracy. Math problem-solving improved by 35%. These gains are real but require context windows that most production systems cannot afford to dedicate entirely to examples. For practical deployments with 4K-128K context windows, few-shot (3-5 examples) remains the right tradeoff.
Common Mistakes in Few-Shot Prompting
The research documents three systematic biases that degrade few-shot performance. These are not theoretical concerns. They show up in production systems and cause subtle, hard-to-debug failures.
Majority Label Bias
If 4 of your 5 examples show 'approved' code and 1 shows 'rejected' code, the model biases toward approving. Balance your example categories. If you're teaching the model to classify, include equal examples of each class.
Recency Bias
LLMs over-weight the last example in the sequence. If your last example shows error handling with try/catch, the model is more likely to use try/catch even when a Result type would be more appropriate. Randomize example order across runs.
Common Token Bias
Models default to frequent tokens over rare but correct ones. If your examples all use 'string' types, the model may default to 'string' even when a more specific type (like a branded type or enum) is correct. Include at least one example with the rare pattern.
Structural Mistakes
Beyond the three biases, practitioners make four structural mistakes that waste tokens and degrade quality:
1. Inconsistent example format. If your first example uses comments to explain the input and your second uses a JSON block, the model does not know which format to follow. Keep the structure of every example identical: same delimiters, same ordering, same level of detail.
2. Irrelevant examples. Examples should be semantically close to the actual task. If you are generating a database query function, showing examples of UI component rendering teaches the model nothing useful and wastes context. The CodeExemplar study found that relevance-based selection outperformed random selection by 5.7 percentage points on pass@1.
3. Too many similar examples. Five examples of the same pattern (e.g., five CRUD functions) teach the model one thing five times. Three diverse examples (a CRUD function, an aggregation function, a validation function) teach it three things. Diversity in examples is more valuable than volume.
4. Examples without edge cases. If all your examples show the happy path, the model learns to ignore error states. Always include at least one example that demonstrates how you handle the failure case, the empty input, or the boundary condition.
Few-Shot Prompting + Context Compression
The tension in few-shot prompting: every example you add improves accuracy but consumes context window space. In a coding agent processing a 50-file repository, 5 few-shot examples might cost 800 tokens. That is 800 tokens not available for source code, documentation, or conversation history.
Context compression changes this equation. Research on compression priorities found that different prompt components tolerate different levels of compression:
| Component | Recommended Retention | Why |
|---|---|---|
| Instructions & constraints | 80-90% | Core logic must be preserved verbatim. Compressing instructions risks changing the task. |
| Few-shot examples | 80-90% | Structural patterns and conventions must survive compression. The model needs to see the format. |
| Retrieved context (docs, code) | 20-40% | Redundant boilerplate, import blocks, and whitespace compress well without information loss. |
Morph Compact applies this hierarchy automatically. When compressing a coding agent's context, it identifies few-shot examples by their structural pattern (input-output pairs, consistent formatting) and preserves them at high fidelity. Boilerplate code, repeated import blocks, and redundant comments get compressed aggressively. The result: 60% fewer tokens with few-shot examples intact.
Without Compression
5 few-shot examples + 30 source files = 45,000 tokens. Fits in a 128K window, but leaves limited space for conversation history. Context rot sets in after 10-15 exchanges as important examples get pushed out.
With Morph Compact
Same 5 examples + 30 files = 18,000 tokens after compression. Examples preserved at 85% fidelity. Source code compressed 65%. Room for 3x more conversation history before context limits hit.
This is the practical unlock. Without compression, teams limit themselves to 1-2 examples to save context space. With compression, they can include 5 high-quality examples and still have room for the codebase context that makes the agent useful. Few-shot accuracy gains without the token cost.
Research Citations and Key Studies
The claims in this guide are grounded in peer-reviewed research. Here are the specific studies behind each number.
Foundational Papers
Brown et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. Trained GPT-3 (175B parameters) and demonstrated that few-shot prompting could match fine-tuned models on dozens of benchmarks. TriviaQA results: zero-shot 64.3%, one-shot 68.0%, few-shot 71.2%. Established that few-shot performance scales with model size faster than zero-shot performance. arxiv.org/abs/2005.14165
Min et al. (2022). "Rethinking the Role of Demonstrations." Found that the label space and input distribution of few-shot examples matter more than whether the labels are correct. Random labels from the true distribution outperform no labels. This means format consistency is more important than example accuracy. arxiv.org/abs/2202.12837
Code Generation Studies
Corso et al. (2024). "The Impact of Prompt Programming on Function-Level Code Generation." CodePromptEval dataset: 7,072 prompts tested across GPT-4o (52.2% pass@1), Llama3-70B (50.5%), Mistral-22B (46.9%). Best technique combination (few-shot + function signature) achieved 57.5% pass@1 on GPT-4o. arxiv.org/abs/2412.20545
Wu & Chan (2024). "Does Few-Shot Learning Help LLM Performance in Code Synthesis?" Developed CodeExemplar-Free and CodeExemplar-Based methods for selecting few-shot examples. Both improved CodeLlama pass@1 by 5.7 percentage points on HumanEval+. Confirmed that example selection strategy matters as much as example count. arxiv.org/abs/2412.02906
Scaling and Compression
Agarwal et al. (2024). "Many-Shot In-Context Learning." Google DeepMind. NeurIPS 2024 Spotlight. Tested Gemini 1.5 Pro with up to 8,192 examples (1M-token context). Code generation scaled from 42% to 62%. Math (MATH dataset) improved 35%. Introduced Reinforced ICL using model-generated rationales. arxiv.org/abs/2404.11018
Zhao et al. (2021). "Calibrate Before Use." Documented majority label bias, recency bias, and common token bias in few-shot prompting. Proposed contextual calibration to mitigate these biases. arxiv.org/abs/2102.09690
Frequently Asked Questions
What is few-shot prompting?
Few-shot prompting is providing 2-5 input-output examples in your prompt before the actual task. The LLM uses these examples to infer the pattern, format, and conventions you want. Brown et al. (2020) showed GPT-3 improved from 64.3% to 71.2% accuracy on TriviaQA using few-shot prompting versus zero-shot, without any model retraining or fine-tuning.
How many few-shot examples should I use?
Start with 3. Research consistently shows diminishing returns after 3-5 examples. Three well-chosen, diverse examples outperform ten redundant ones. The CodeExemplar study found that example selection strategy improved pass@1 by 5.7 points, while simply adding more random examples had inconsistent effects. If you are using a model with a large context window (128K+), you can experiment with more, but test and measure.
What is the difference between zero-shot, one-shot, and few-shot prompting?
Zero-shot: no examples, instructions only. One-shot: one input-output example. Few-shot: 2-5 examples. On GPT-3 TriviaQA benchmarks, zero-shot scored 64.3%, one-shot scored 68.0%, and few-shot scored 71.2%. The biggest per-example improvement comes from the first example. Each additional example adds smaller gains. Beyond 5, gains are marginal unless you have a very large context window and use Google DeepMind's many-shot approach (tested up to 8,192 examples).
Does few-shot prompting work for code generation?
Yes, and it works differently than for general NLP tasks. In code generation, few-shot examples teach the model your codebase's conventions: naming patterns, error handling, abstractions, test structure. The CodePromptEval study (7,072 prompts, 3 models) found that few-shot + function signature was the most effective combination, pushing GPT-4o from 47.1% to 57.5% pass@1. For coding agents specifically, the files selected for context serve as implicit few-shot examples.
What are the biggest mistakes with few-shot prompting?
Three documented biases: majority label bias (model favors the most common pattern in your examples), recency bias (model over-weights the last example), and common token bias (model defaults to frequent tokens). Mitigate by balancing example types, randomizing order, and including diverse edge cases. The most common structural mistake is including redundant examples instead of diverse ones. Five CRUD examples teach one pattern five times. Three diverse examples teach three patterns.
How does context compression affect few-shot examples?
Few-shot examples should be compressed conservatively (retaining 80-90% of content) while retrieved context like documentation and source code can be compressed more aggressively (20-40% retention). Morph Compact handles this automatically, achieving 60% overall compression while preserving the structural patterns in few-shot examples. This lets coding agents include 5 high-quality examples without sacrificing context space for source code.
Related Resources
Preserve Your Few-Shot Examples, Cut Your Token Costs
Morph Compact compresses coding agent context by 60% while preserving few-shot examples at high fidelity. Same accuracy, fraction of the tokens.