Semantic Code Search: How It Works, Where It Fails, and What Actually Ships

Semantic code search finds code by meaning instead of keywords. Here's how embeddings, AST parsing, and vector search work under the hood, where they break down, and why the best systems combine both approaches.

February 28, 2026 · 5 min read

Semantic code search finds code by meaning instead of exact text matches. You type "where do we validate JWT tokens" and get back the middleware function that checks token expiration, even if the word "validate" never appears in the source. The promise is compelling. The reality is more nuanced than most tool vendors admit.

12.5%
Average accuracy gain from semantic search (Cursor)
42.3%
Exact match: GrepRAG on CrossCodeEval
17x
Faster: grep vs graph-based retrieval
70%
Context rot reduction with RL-trained search

How Semantic Code Search Works

Semantic code search has three stages: chunking, embedding, and retrieval. Each stage introduces tradeoffs that determine whether the system actually helps or just adds latency.

1. AST-Based Chunking

The best chunkers use Abstract Syntax Trees from parsers like tree-sitter to split code on function boundaries, class definitions, and method signatures. This preserves semantic integrity: a function is indexed as one unit, not split across arbitrary 200-line blocks. If the parser can't identify logical split points, the chunker falls back to splitting on line boundaries while respecting maximum chunk sizes.

2. Vector Embeddings

Each chunk becomes a numerical vector, typically 512 to 1,536 dimensions, that captures its semantic meaning. Similar code maps to nearby points in vector space. Cursor trained a custom embedding model using agent session traces as training data: an LLM ranks which code segments would have been most helpful at each step, and the embedding model learns to align similarity scores with those rankings.

3. Nearest-Neighbor Retrieval

When you search, your query gets embedded the same way and compared against stored vectors using k-nearest neighbor search. Results are filtered by project, file path, and authorization, returning the top 10 most similar chunks by default. The actual source code stays on your machine; only embeddings and metadata go to the vector database.

AST Chunking

Tree-sitter parses code into functions, classes, and methods. Each chunk is a semantically meaningful unit, not an arbitrary text block.

Vector Embedding

Each chunk becomes a 512-1536 dimensional vector. Models like Cursor's are trained on real agent search patterns, not generic text similarity.

KNN Retrieval

Your query is embedded and compared against all stored vectors. Results ranked by cosine similarity, filtered by project scope.

The Performance Data

The debate between grep and semantic search has actual numbers behind it now. Here is what the benchmarks say.

SystemPython EM (CrossCodeEval)Retrieval Latency
GrepRAG (ripgrep)42.29%0.018s
RLCoder (RL-based)39.46%~3.0s
RepoFuse (structure)38.62%~3.0s
VanillaRAG (BM25)24.99%~3.0s
GraphCoder (graph)19.44%6.9s

Cursor's internal benchmarks tell a different but complementary story. Adding their custom semantic search to the existing grep-based agent improved accuracy by 12.5% on average across all tested models (6.5% to 23.5% depending on the model). On large codebases with 1,000+ files, users with semantic search showed 2.6% higher code retention. And disabling semantic search increased dissatisfied follow-up requests by 2.2%.

Both datasets measure different things

GrepRAG measures code completion accuracy on repository-level tasks. Cursor measures agent accuracy on open-ended user queries. GrepRAG shows grep can beat semantic methods on structured retrieval. Cursor shows semantic search adds value on fuzzy, real-world queries. The two findings are not contradictory.

Augment's Context Engine showed the largest gains: 80% improvement on Claude Code + Opus 4.5, 71% on Cursor + Claude Opus 4.5, and 30% on Cursor + Composer-1 on SWE-Bench tasks. These numbers reflect what happens when semantic indexing gives an agent a conceptual map of the codebase before it starts searching.

Where Semantic Search Fails

The Precision Problem

Code embeddings have a fundamental mismatch between semantic similarity and functional correctness. Changing < to <= barely moves the embedding vector but can flip every test in the suite. Two implementations with completely different variable names and control flow sit far apart in vector space but return identical results. The embedding captures what the code looks like, not what it does.

The Scale Ceiling

Google DeepMind proved that 512-dimensional embeddings break down around 500K documents. At scale, the vector space becomes crowded: too many code chunks occupy similar regions, and nearest-neighbor search returns increasingly noisy results. BM25 keyword search outperformed neural embedding models on their benchmark at this scale.

The Staleness Problem

Embeddings need to be recomputed when code changes. In fast-moving codebases, the index falls behind. A function that was renamed yesterday still has its old embedding in the vector store. Claude Code's Boris Cherny cited staleness as a key reason they dropped their local vector database. Agentic grep always searches the current state of the code.

The Distractor Problem

When you search for "webhook handler," semantic search returns everything that is semantically close: the handler, the test fixtures, the deprecated implementation, the similarly-named utility function. Chroma's context rot research showed that these semantically similar but factually irrelevant distractors are the worst kind of noise for an LLM. They cause more hallucinations than random irrelevant text.

Where Grep Fails

The Unknown Identifier Problem

Grep requires knowing what to search for. "Where do we handle rate limiting?" returns nothing if the function is called throttleRequests or applyBackpressure. The GrepRAG paper showed that the LLM generating grep queries partially solves this by expanding keywords, but 28.5% of Python retrieval failures and 24.9% of Java failures were still recall failures: the right keywords simply never appeared in the grep commands.

The Cross-File Dependency Problem

"How does the auth middleware interact with the session store?" requires understanding import chains, call graphs, and framework conventions. No single grep command captures these multi-hop relationships. The agent has to run multiple searches and piece the picture together, accumulating context with each step.

The Token Cost Problem

Agentic grep loops burn tokens. The LLM generates a search query, reads the results, decides if they are relevant, generates another query, reads more results. Each iteration adds to the context window. Research on token consumption found some agent runs consumed 10x more tokens than others on similar tasks, driven almost entirely by search efficiency.

Query TypeBest ApproachWhy
Known identifierGrepExact match, milliseconds, zero setup
Fuzzy conceptSemanticFinds code even without keyword overlap
Cross-file relationshipsSemantic + grepSemantic provides the map, grep reads the files
Large-scale refactorGrepRegex patterns across entire codebase
Legacy code explorationSemanticHandles inconsistent naming conventions
Multi-step agent taskRL-trained hybridLearns which tool to use and when to stop

The Real Problem: Context Accumulation

Better search technology helps, but it doesn't solve the deeper problem. Cognition measured that agents spend over 60% of their first turn just retrieving context. Not editing. Not reasoning. Searching. And every search result stays in the context window for the rest of the session.

Context accumulation in a typical agent task

Step 1: Agent reads issue description             →     500 tokens
Step 2: Grep + semantic search, reads 6 files    →   9,000 tokens
Step 3: Needs more context, reads 3 more files   →   6,000 tokens
Step 4: Backtracks, reads test files              →   5,000 tokens
Step 5: Found the right file, but now carries     →  20,500 tokens
         ↑ 90% of this is irrelevant noise

Whether the 20K tokens came from grep results or semantic
search results, the damage to the model is the same.

This is the context rot problem. Chroma's research tested 18 frontier models and found every one degrades as input length increases, even when the context window is not close to full. Performance drops by 30%+ when relevant information sits in the middle of the context. The search method does not change this dynamic. More precise search results get there faster, but a bad search loop with either grep or semantic search creates the same context pollution.

60%
Agent time spent on search, not coding
30%+
Performance drop from lost-in-the-middle
10x
Token variance between efficient and inefficient searches
20K+
Tokens of noise in typical agent context

Frequently Asked Questions

What is semantic code search?

Semantic code search finds code by meaning rather than exact text matches. It uses vector embeddings to represent code chunks as high-dimensional numbers, then finds the closest matches to your query. You can search "where do we handle authentication" and find the relevant middleware even if the word "authentication" never appears in the source code.

How does semantic code search differ from grep?

Grep matches exact text patterns and runs in milliseconds with no index required. Semantic search matches conceptual meaning and requires pre-computed embeddings stored in a vector database. Grep is better when you know the identifier. Semantic search is better for fuzzy queries like "where is rate limiting handled" when you don't know the function is called throttleRequests. Production systems use both.

What tools offer semantic code search?

Cursor ships a custom embedding model trained on agent session traces. Augment's Context Engine provides semantic indexing as an MCP server compatible with Claude Code and Cursor. Sourcegraph offers trigram-based search at enterprise scale. CodeGrok uses tree-sitter AST parsing with ChromaDB for local vector search. GitHub code search uses a combination of keyword and semantic matching.

Is semantic code search better than grep for coding agents?

Neither alone is optimal. Cursor's research showed semantic search adds 12.5% average accuracy improvement, but only when combined with grep. The GrepRAG paper (ISSTA 2026) showed lightweight ripgrep pipelines outperforming graph-based semantic retrieval while running 17x faster. The best agent architectures use both approaches together.

How do code embeddings work?

Code is split into chunks using AST parsers like tree-sitter, which split on function boundaries and class definitions rather than arbitrary line counts. Each chunk is converted to a vector embedding, a numerical array typically of 512 to 1,536 dimensions, using an embedding model. Similar code maps to nearby points in vector space. Queries are embedded the same way and compared using nearest-neighbor search in a vector database like Turbopuffer, ChromaDB, or Pinecone.

What are the limitations of semantic code search?

Code embeddings have a precision problem: changing < to <= barely moves the embedding but can flip every test. Embeddings require index maintenance and fall behind in fast-moving codebases. Google DeepMind showed that 512-dimensional embeddings break down around 500K documents. And semantically similar but irrelevant results (test fixtures, deprecated code) are the worst kind of noise for LLMs.

What is the best approach to code search for AI coding agents?

The best approach minimizes irrelevant context reaching the reasoning model. Agents accumulate 20,000+ tokens of search noise during multi-step tasks. RL-trained search agents like WarpGrep isolate search in a separate context window, using both grep and learned retrieval strategies, then return only the relevant file and line ranges. This reduces context rot by 70% and speeds up task completion by 40%.

Search That Learns What to Keep

WarpGrep combines RL-trained search intelligence with code understanding, bridging grep speed and semantic understanding. 70% less context rot, 40% faster task completion, and every frontier model lifted to #1 on SWE-Bench Pro.