Analyzing Cursor Composer and Apply

Cursor recently published a fascinating technical deep-dive into their code editing technology. Their research reveals how they are handling large-scale code edits, achieving speeds of 1000 tokens per second.

The Problem: Large Code Edit Challenges

According to Cursor's research, frontier models like GPT-4o and Claude struggle with large code edits in three key areas:

Latency: Traditional token-by-token generation is too slow
Accuracy: Models often make mistakes on complex edits
Consistency: Multiple model calls can lead to infinite loops or inconsistent results The hard part of applying is the making the inference FAST, and work at scale reliably.

As they demonstrate in their blog post:

"Even small, isolated edits are plagued with bugs [...] SWE-Agent attempts a simple edit seven times before giving up due to a consistent syntactic error."

Cursor's Context-Aware Architecture

Cursor's approach to code editing relies on a sophisticated context retrieval system:

Embedding and Retrieval:
- Queries are embedded using a specialized embedding model
- System fetches top-k relevant syntax chunks from the codebase
- Provides focused context without overwhelming the model
Reranking Process:
- Retrieved chunks undergo reranking to prioritize most relevant information
- Ensures the model receives the most pertinent context first
- Filters out less useful code snippets
Priompt Framework:
- A specialized prompting framework for prioritizing context
- Structures information to maximize model understanding
- Ensures critical context appears in optimal positions
Apply Phase:
- Uses their specialized fast-apply model - Llama-3-70b-ft
- Executes the planned changes quickly and accurately
- Leverages the context gathered in previous steps

The Technical Implementation

Cursor's blog reveals how this context-aware architecture feeds into their execution pipeline:

Full File Rewriting vs. Diffs

They chose full file rewriting over diffs for three reasons:

Token Context: More output tokens give the model more forward passes to determine the correct solution
Training Distribution: Models have seen more complete files than diffs during training
Line Number Challenges: Models struggle with accurate line number counting in diffs

Model Architecture

According to their blog post:

Base Model: Llama-3-70b
Performance: "~13x speedup over vanilla inference"
Comparison: "~9x speedup over previous GPT-4 speculative edits deployment"

Speculative Edits

One of their most interesting innovations is "speculative edits", which they describe as:

"With code edits, we have a strong prior on the draft tokens at any point in time, so we can speculate on future tokens using a deterministic algorithm rather than a draft model."

This approach yields:

4-5x speedup over traditional methods
Equivalent accuracy to full-file rewrites
Significantly faster than GPT-4 with speculative decoding

Performance Metrics

Cursor's evaluation methodology is particularly rigorous:

speed = Num_Rewritten_Chars / Latency_for_Rewrite_in_seconds

Their metrics show:

~1000 tokens/s processing speed
3500 char/s throughput
Consistent performance across file sizes up to 400 lines

Model Training Insights

Their blog reveals interesting training decisions:

Data Preparation:
- Downsampled files under 100 LOC
- Balanced training examples per filename
- Filtered out no-op transformations
Model Selection:
- Tested both Deepseek Coder Instruct and Llama 3
- Found Llama-3-70b-ft performed best
- Outperformed GPT-4 Turbo in evaluations

Future Challenges

Cursor identifies several areas for improvement:

Long Context: Working on handling files up to 2500 lines
Model Size: Exploring distillation to llama-3-8b
Accuracy: Investigating on-policy RL for better performance

Implications for Morph

Cursor's research validates several key principles we've been exploring:

The importance of specialized models for code editing
The benefits of full-file context over diff-based approaches
The potential of speculative decoding in code transformation

Their work demonstrates the viability of high-speed code transformation while highlighting the challenges and trade-offs involved in building such systems.

Conclusion and Key Takeaways

Rigorous Analysis: Cursor's deep-dive is highly technical, detailing performance metrics (e.g., ~1000 tokens/s, 3500 char/s) and comparing different model architectures.
Clear Structure: By dividing code editing into planning and apply phases, they simplify a complex problem into manageable components.
Innovative Approach: Their use of speculative edits as a deterministic mechanism to forecast future tokens is a notable innovation that yields significant speedups.
Transparency and Future Focus: Discussing challenges like long-context training and model distillation offers readers insight into ongoing research directions. Overall, Cursor's technical blog sets a high standard for detailed and insightful analysis in the AI-driven code editing space.

Comparison with morph-v0

We've been working on a similar approach to Cursor Composer and Apply, but with a focus on speed and accuracy. Our approach is to use a specialized model to plan the changes, and then use a different specialized model to apply the changes.

We've achieved:

2000+ tokens/s processing speed
Consistent performance across file sizes up to 1500 lines

Observations

Linear scaling of RoPE ids does not scale well. This task needs to be trained on a large dataset of code edits, with long input squence and output sequence lengths.
For large files this is incredibly memory intensive, and requires a lot of compute.
Recent process reward modeing shows promise, but it would slow applies down. The reward model would need to output diffs and the output content would need to prove to be more useful context for it than the original code.

How Cursor Composer and Apply Work