Morph: Enable Faster Coding Agents
A deep dive into how we get morph to be so fast.

Posted by Tejas Bhakta
6 minute read
Why speed matters
Morph is a specialized LLM that is trained to be world class at applying changes to code and files. We've been working on a similar approach to Cursor Composer and Apply, but with a focus on speed and accuracy. Our approach is to use a specialized model to apply the changes.
We've achieved:
- 1000 tokens/s processing speed
- Consistent performance across file sizes up to 1500 lines
From an ML perspective, this is not an unsolved problem. GPT 4o, and Claude Sonnet are both capable of applying edits with the right prompt and sufficient yelling
The hard part is that we need to do this fast, cheap, and at scale. Using Frontier models to apply changes to files is not practical from a cost and latency perspective. Coding Agents do not have that good product "feel without this.
GPU | Memory (GB) | Memory bandwidth (GB/s) | FP16 Tensor Core (TFLOP/s) |
---|---|---|---|
A10 | 24 | 600 | 125 |
A100 PCIe/SXM | 80 | 1935/2039 | 312 |
L4 | 24 | 300 | 121 |
L40S | 48 | 864 | 362 |
H100 PCIe/SXM | 80 | 2000/3350 | 756/989 |
H200 SXM | 141 | 4800 | 989 |
Performance Improvements
Our approach leads to significant speedups:
- 5-8x faster than traditional inference
- Near-instant results
There are 3 main things we can leverage:
- The task is very narrow and we can optimize for that
- The input and output are very similar in content, so we can use speculation to speed up inference
- The input and output length are often similar
The Problem
The naive approach to building an LLM code modification system is straightforward but inefficient:
- Load model into GPU memory
- For each request: tokenize input, run inference, generate tokens, detokenize output
- Return modified code
While this works for demos, it hits major bottlenecks:
- Poor GPU utilization with idle time between requests
- Sequential processing creates queues
- Fixed batch size misses parallelization opportunities
- Memory inefficiency from reserving full GPU per request
This typically achieves only 300-400 tokens/second at useful model sizes on a A100/H100 and scales linearly with cost, making it impractical for a production service. We need something much faster.
Optimization Approaches
We've optimized the inference pipeline for speed. This is by far the hardest part of making Morph work well. To have a free tier, we need to be able to process requests in a way that feels instant and doesn't have annoying batching latency.
Batching
- Why should we batch?
- Batching is a technique that allows us to process multiple LLM inference requests in parallel, and its a core way to maximize token throughput.
- Batching comes at the cost of latency.
Optimizations
Continuous Batching with Dynamic SLAs
We've implemented an intelligent batching system that adapts to real-time request patterns. Rather than using fixed batch sizes, our system dynamically adjusts based on incoming traffic and client priorities. This means enterprise users get immediate processing while free tier requests are smartly batched for optimal throughput.
Speculative Decoding
Using the original code as speculation for the decoder. Since most edits preserve significant portions of the input, this gives us a speedup by avoiding redundant computation. The model can quickly verify correct sections and focus computation on the parts that need changes.
Quantization-Aware KV Cache
Memory management is crucial for scalable fast inference. Our KV cache system implements advanced quantization techniques:
KV Cache Quantization Format Analysis
We've extensively tested two key formats:
-
E5M2 (5 exponent, 2 mantissa bits)
- Dynamic range: ±57344.0
- Better suited for attention values with large dynamic ranges
- Lower precision but requires less scaling overhead
- Direct compatibility with existing CUDA kernels
-
E4M3FN (4 exponent, 3 mantissa bits)
- Dynamic range: ±240.0
- Higher precision for small/medium values
- Requires FP32 scaling factors per tensor
- Better numerical stability for attention patterns
Implementation Details
- Fused dequantization + attention operations are needed to be useful for minimal latency
- Custom CUDA kernels optimized for E5M2 format (in progress)
- Adaptive scaling based on attention head statistics
- Intelligent cache eviction using access patterns and priority queues
Performance Impact
- 4x memory reduction vs FP16
- Neglibible quality degradation on our test suite
- 2.3x reduced memory bandwidth usage
- Negligible latency overhead from quantization/dequantization
There's a bit more to making this less naive that we won't go into here. Contact us if you're interested in the details. We're currently developing CUDA kernels that fuse dequantization directly into attention computation, eliminating separate dequantization passes. Early benchmarks show an additional 15-20% speedup.
Custom CUDA Kernels (In Progress)
We're working on specialized GPU operations optimized specifically for code editing. Code editinng is an extremely narrow task and we can leverage that to squeeze out performance.
- Fused attention operations that minimize memory transfers
- Batch-aware implementations that maximize GPU utilization
- Custom variants of FlashAttention tuned for our use case
A wise man once said - OK CUDA is hard, good CUDA is harder. We're working on this.
How do you build a fast coding agent on this?
- Apply changes in the background, before the user clicks apply. Batched requests will cost less for business users
- When a user wants to apply changes directly, process it immediately.
Future Optimizations
We're working on several improvements to make Morph even faster and more accurate:
Adaptive Speculation
Our current speculative edits approach uses a fixed strategy, but we're developing a system that dynamically adjusts speculation based on the type of transformation being performed. For example:
- Type hint additions can use aggressive speculation since the changes are usually localized
- Refactoring operations may need more conservative speculation due to their complexity
- Small formatting changes can use maximum speculation for fastest possible throughput
Near Symmetric Inference
- This task is very symmetric - meaning its not inherently prefill leaning (though it is prefill heavy). Input and output are relatively similar in length usually.
- This is important because it allows us to make a lot of the optimizations that are useful for near symmetric tasks.
Performance on huge files
- We've tested up to 2000 lines with consistent performance.
- We're working on pushing this even further. The main limit here is cost of training and inference.