Semantic Code Search: How It Works and Where It Fails

Semantic code search finds code by meaning instead of exact text matches. You type "where do we validate JWT tokens" and get back the middleware function that checks token expiration, even if the word "validate" never appears in the source. The promise is compelling. The reality is more nuanced than most tool vendors admit.

12.5%

Average accuracy gain from semantic search (Cursor)

42.3%

Exact match: GrepRAG on CrossCodeEval

17x

Faster: grep vs graph-based retrieval

70%

Context rot reduction with RL-trained search

How Semantic Code Search Works

Semantic code search has three stages: chunking, embedding, and retrieval. Each stage introduces tradeoffs that determine whether the system actually helps or just adds latency.

1. AST-Based Chunking

The best chunkers use Abstract Syntax Trees from parsers like tree-sitter to split code on function boundaries, class definitions, and method signatures. This preserves semantic integrity: a function is indexed as one unit, not split across arbitrary 200-line blocks. If the parser can't identify logical split points, the chunker falls back to splitting on line boundaries while respecting maximum chunk sizes.

2. Vector Embeddings

Each chunk becomes a numerical vector, typically 512 to 1,536 dimensions, that captures its semantic meaning. Similar code maps to nearby points in vector space. Cursor trained a custom embedding model using agent session traces as training data: an LLM ranks which code segments would have been most helpful at each step, and the embedding model learns to align similarity scores with those rankings.

3. Nearest-Neighbor Retrieval

When you search, your query gets embedded the same way and compared against stored vectors using k-nearest neighbor search. Results are filtered by project, file path, and authorization, returning the top 10 most similar chunks by default. The actual source code stays on your machine; only embeddings and metadata go to the vector database.

AST Chunking

Tree-sitter parses code into functions, classes, and methods. Each chunk is a semantically meaningful unit, not an arbitrary text block.

Vector Embedding

Each chunk becomes a 512-1536 dimensional vector. Models like Cursor's are trained on real agent search patterns, not generic text similarity.

KNN Retrieval

Your query is embedded and compared against all stored vectors. Results ranked by cosine similarity, filtered by project scope.

How Grep-Based Search Works

Grep operates on source text directly. You give it a pattern, it scans every line, and it returns exact matches. Ripgrep does this across entire repositories in milliseconds by memory-mapping files and using SIMD-accelerated regex matching. There is no index to build, no embeddings to compute, no vector database to maintain.

For coding agents, grep has a critical advantage: zero setup cost. The agent can start searching the moment it receives a task. No indexing step, no stale embeddings, no infrastructure. Claude Code, OpenAI Codex, and Cline all use agentic grep loops as their primary search mechanism.

Agentic search: how coding agents use grep

# Agent receives task: "fix the broken webhook handler"
# Step 1: Broad search
$ rg "webhook" --type ts -l
src/api/webhooks/stripe.ts
src/api/webhooks/github.ts
src/tests/webhooks.test.ts

# Step 2: Narrow to the specific handler
$ rg "handleStripeWebhook" src/api/webhooks/stripe.ts -n
47:export async function handleStripeWebhook(req: Request) {
89:  // BUG: missing signature verification

# Step 3: Agent reads lines 47-89, edits the fix
# Total retrieval: ~200 tokens of precise context

The GrepRAG paper (ISSTA 2026) formalized this approach. They found that LLMs generating ripgrep commands categorize keywords effectively: 41.5% of commands target method names, 36% target class names, 18.4% target variable names, and 23.5% use wildcard fuzzy matching. The LLM acts as the "semantic" layer, deciding what to grep for, while ripgrep handles the how.

The Performance Data

The debate between grep and semantic search has actual numbers behind it now. Here is what the benchmarks say.

System	Python EM (CrossCodeEval)	Retrieval Latency
GrepRAG (ripgrep)	42.29%	0.018s
RLCoder (RL-based)	39.46%	~3.0s
RepoFuse (structure)	38.62%	~3.0s
VanillaRAG (BM25)	24.99%	~3.0s
GraphCoder (graph)	19.44%	6.9s

Cursor's internal benchmarks tell a different but complementary story. Adding their custom semantic search to the existing grep-based agent improved accuracy by 12.5% on average across all tested models (6.5% to 23.5% depending on the model). On large codebases with 1,000+ files, users with semantic search showed 2.6% higher code retention. And disabling semantic search increased dissatisfied follow-up requests by 2.2%.

Both datasets measure different things

GrepRAG measures code completion accuracy on repository-level tasks. Cursor measures agent accuracy on open-ended user queries. GrepRAG shows grep can beat semantic methods on structured retrieval. Cursor shows semantic search adds value on fuzzy, real-world queries. The two findings are not contradictory.

Augment's Context Engine showed the largest gains: 80% improvement on Claude Code + Opus 4.5, 71% on Cursor + Claude Opus 4.5, and 30% on Cursor + Composer-1 on SWE-Bench tasks. These numbers reflect what happens when semantic indexing gives an agent a conceptual map of the codebase before it starts searching.

Where Semantic Search Fails

The Precision Problem

Code embeddings have a fundamental mismatch between semantic similarity and functional correctness. Changing < to <= barely moves the embedding vector but can flip every test in the suite. Two implementations with completely different variable names and control flow sit far apart in vector space but return identical results. The embedding captures what the code looks like, not what it does.

The Scale Ceiling

Google DeepMind proved that 512-dimensional embeddings break down around 500K documents. At scale, the vector space becomes crowded: too many code chunks occupy similar regions, and nearest-neighbor search returns increasingly noisy results. BM25 keyword search outperformed neural embedding models on their benchmark at this scale.

The Staleness Problem

Embeddings need to be recomputed when code changes. In fast-moving codebases, the index falls behind. A function that was renamed yesterday still has its old embedding in the vector store. Claude Code's Boris Cherny cited staleness as a key reason they dropped their local vector database. Agentic grep always searches the current state of the code.

The Distractor Problem

When you search for "webhook handler," semantic search returns everything that is semantically close: the handler, the test fixtures, the deprecated implementation, the similarly-named utility function. Chroma's context rot research showed that these semantically similar but factually irrelevant distractors are the worst kind of noise for an LLM. They cause more hallucinations than random irrelevant text.

Where Grep Fails

The Unknown Identifier Problem

Grep requires knowing what to search for. "Where do we handle rate limiting?" returns nothing if the function is called throttleRequests or applyBackpressure. The GrepRAG paper showed that the LLM generating grep queries partially solves this by expanding keywords, but 28.5% of Python retrieval failures and 24.9% of Java failures were still recall failures: the right keywords simply never appeared in the grep commands.

The Cross-File Dependency Problem

"How does the auth middleware interact with the session store?" requires understanding import chains, call graphs, and framework conventions. No single grep command captures these multi-hop relationships. The agent has to run multiple searches and piece the picture together, accumulating context with each step.

The Token Cost Problem

Agentic grep loops burn tokens. The LLM generates a search query, reads the results, decides if they are relevant, generates another query, reads more results. Each iteration adds to the context window. Research on token consumption found some agent runs consumed 10x more tokens than others on similar tasks, driven almost entirely by search efficiency.

Query Type	Best Approach	Why
Known identifier	Grep	Exact match, milliseconds, zero setup
Fuzzy concept	Semantic	Finds code even without keyword overlap
Cross-file relationships	Semantic + grep	Semantic provides the map, grep reads the files
Large-scale refactor	Grep	Regex patterns across entire codebase
Legacy code exploration	Semantic	Handles inconsistent naming conventions
Multi-step agent task	RL-trained hybrid	Learns which tool to use and when to stop

Hybrid Search in Production

Every production system that ships semantic code search also ships grep. The question is how they combine them.

Cursor: Trained Embeddings + Agentic Grep

Cursor trains a custom embedding model on data from agent session traces. An LLM ranks which code segments would have been most helpful during actual coding sessions, and the embedding model learns to match those rankings. This produces embeddings tuned for the specific kind of queries coding agents make, not generic text similarity. The semantic results supplement grep results, and Cursor's data shows the combination "leads to the best outcomes."

Augment: Full Codebase Knowledge Graph

Augment's Context Engine goes further than embeddings. It maintains a real-time semantic index that understands not just what code exists but how pieces relate to each other. When you ask "add logging to payment requests," it maps the entire path: React app, Node API, payment service, database, webhook handlers. Available as an MCP server, it plugs into Claude Code, Cursor, and other agents.

Sourcegraph: Trigram Search at Scale

Sourcegraph built Zoekt, a trigram-based code search engine, for enterprise-scale codebases spanning multiple code hosts. It provides precise code navigation with IDE-level "go to definition" and "find references" across repositories. Where GitHub code search only works within GitHub, Sourcegraph integrates with GitLab, Bitbucket, and other hosts.

The Real Problem: Context Accumulation

Better search technology helps, but it doesn't solve the deeper problem. Cognition measured that agents spend over 60% of their first turn just retrieving context. Not editing. Not reasoning. Searching. And every search result stays in the context window for the rest of the session.

Context accumulation in a typical agent task

Step 1: Agent reads issue description             →     500 tokens
Step 2: Grep + semantic search, reads 6 files    →   9,000 tokens
Step 3: Needs more context, reads 3 more files   →   6,000 tokens
Step 4: Backtracks, reads test files              →   5,000 tokens
Step 5: Found the right file, but now carries     →  20,500 tokens
         ↑ 90% of this is irrelevant noise

Whether the 20K tokens came from grep results or semantic
search results, the damage to the model is the same.

This is the context rot problem. Chroma's research tested 18 frontier models and found every one degrades as input length increases, even when the context window is not close to full. Performance drops by 30%+ when relevant information sits in the middle of the context. The search method does not change this dynamic. More precise search results get there faster, but a bad search loop with either grep or semantic search creates the same context pollution.

60%

Agent time spent on search, not coding

30%+

Performance drop from lost-in-the-middle

10x

Token variance between efficient and inefficient searches

20K+

Tokens of noise in typical agent context

RL-Trained Search: Beyond Grep vs. Semantic

The grep-vs-semantic debate misses what actually matters: how much irrelevant context reaches the reasoning model. Both grep and semantic search dump results into the same context window. Better search finds the right code faster, but the exploration trace, the dead ends, the rejected files, all of it stays.

Anthropic's multi-agent research showed the solution: isolate search into a dedicated context window. Their Opus 4 lead agent delegating to Sonnet 4 subagents outperformed a single Opus 4 agent by 90.2% on research tasks. Not because the subagents were smarter. Because the lead agent's context stayed clean.

WarpGrep applies this principle to code search. Instead of embedding the codebase or running naive grep loops, it uses reinforcement learning to train a search agent that learns which queries to run, which results to keep, and when to stop. It operates in its own context window, explores and backtracks without polluting the parent, and returns only the precise file and line ranges the coding model needs.

Isolated Search Context

WarpGrep explores in its own context window. The parent coding model never sees the 15 files that were explored and rejected.

RL-Trained Retrieval

Trained with reinforcement learning to pick the right search strategy: when to grep, when to read, when to stop. Not hardcoded heuristics.

Precise Returns

Returns 50-200 tokens of exact file and line ranges. The coding model gets clean, relevant context instead of 20K tokens of search noise.

70%

Context rot reduction

40%

Faster task completion

15.6%

Cheaper than self-search

28%

Faster than self-search

Why adding a model makes it cheaper

The cost reduction sounds counterintuitive. Adding a model to the system should cost more. But the expensive reasoning model stops wasting tokens on search. It sees fewer irrelevant files, generates fewer wasted tokens, and finishes sooner. The subagent model is smaller and cheaper, and the total token bill goes down.

Frequently Asked Questions

What is semantic code search?

Semantic code search finds code by meaning rather than exact text matches. It uses vector embeddings to represent code chunks as high-dimensional numbers, then finds the closest matches to your query. You can search "where do we handle authentication" and find the relevant middleware even if the word "authentication" never appears in the source code.

How does semantic code search differ from grep?

Grep matches exact text patterns and runs in milliseconds with no index required. Semantic search matches conceptual meaning and requires pre-computed embeddings stored in a vector database. Grep is better when you know the identifier. Semantic search is better for fuzzy queries like "where is rate limiting handled" when you don't know the function is called throttleRequests. Production systems use both.

What tools offer semantic code search?

Cursor ships a custom embedding model trained on agent session traces. Augment's Context Engine provides semantic indexing as an MCP server compatible with Claude Code and Cursor. Sourcegraph offers trigram-based search at enterprise scale. CodeGrok uses tree-sitter AST parsing with ChromaDB for local vector search. GitHub code search uses a combination of keyword and semantic matching.

Is semantic code search better than grep for coding agents?

Neither alone is optimal. Cursor's research showed semantic search adds 12.5% average accuracy improvement, but only when combined with grep. The GrepRAG paper (ISSTA 2026) showed lightweight ripgrep pipelines outperforming graph-based semantic retrieval while running 17x faster. The best agent architectures use both approaches together.

How do code embeddings work?

Code is split into chunks using AST parsers like tree-sitter, which split on function boundaries and class definitions rather than arbitrary line counts. Each chunk is converted to a vector embedding, a numerical array typically of 512 to 1,536 dimensions, using an embedding model. Similar code maps to nearby points in vector space. Queries are embedded the same way and compared using nearest-neighbor search in a vector database like Turbopuffer, ChromaDB, or Pinecone.

What are the limitations of semantic code search?

Code embeddings have a precision problem: changing < to <= barely moves the embedding but can flip every test. Embeddings require index maintenance and fall behind in fast-moving codebases. Google DeepMind showed that 512-dimensional embeddings break down around 500K documents. And semantically similar but irrelevant results (test fixtures, deprecated code) are the worst kind of noise for LLMs.

What is the best approach to code search for AI coding agents?

The best approach minimizes irrelevant context reaching the reasoning model. Agents accumulate 20,000+ tokens of search noise during multi-step tasks. RL-trained search agents like WarpGrep isolate search in a separate context window, using both grep and learned retrieval strategies, then return only the relevant file and line ranges. This reduces context rot by 70% and speeds up task completion by 40%.

Search That Learns What to Keep

WarpGrep combines RL-trained search intelligence with code understanding, bridging grep speed and semantic understanding. 70% less context rot, 40% faster task completion, and every frontier model lifted to #1 on SWE-Bench Pro.

Try WarpGrep

Read the Research

Morph Fast Apply

Morph WarpGrep

Morph Glance

Morph MCP

Morph Monitor

Semantic Code Search: How It Works, Where It Fails, and What Actually Ships