Why speed matters

Morph is a specialized LLM that is trained to be world class at applying changes to code and files. We've been working on a similar approach to Cursor Composer and Apply, but with a focus on speed and accuracy. Our approach is to use a specialized model to apply the changes.

We've achieved:

10,500+ tokens/s processing speed
Consistent performance across file sizes up to 1500 lines

From an ML perspective, this is not an unsolved problem. GPT 4o, and Claude Sonnet are both capable of applying edits with the right prompt and sufficient yelling

The hard part is that we need to do this fast, cheap, and at scale. Using Frontier models to apply changes to files is not practical from a cost and latency perspective. Coding Agents do not have that good product "feel without this.

GPU	Memory (GB)	Memory bandwidth (GB/s)	FP16 Tensor Core (TFLOP/s)
A10	24	600	125
A100 PCIe/SXM	80	1935/2039	312
L4	24	300	121
L40S	48	864	362
H100 PCIe/SXM	80	2000/3350	756/989
H200 SXM	141	4800	989

Performance Improvements

Our approach leads to significant speedups:

5-8x faster than traditional inference
Near-instant results

There are 3 main things we can leverage:

The task is very narrow and we can optimize for that
The input and output are very similar in content, so we can use speculation to speed up inference
The input and output length are often similar

The Problem

The naive approach to building an LLM code modification system is straightforward but inefficient:

Load model into GPU memory
For each request: tokenize input, run inference, generate tokens, detokenize output
Return modified code

While this works for demos, it hits major bottlenecks:

Poor GPU utilization with idle time between requests
Sequential processing creates queues
Fixed batch size misses parallelization opportunities
Memory inefficiency from reserving full GPU per request

This typically achieves only 300-400 tokens/second at useful model sizes on a A100/H100 and scales linearly with cost, making it impractical for a production service. We need something much faster.

Optimization Approaches

We've optimized the inference pipeline for speed. This is by far the hardest part of making Morph work well. To have a free tier, we need to be able to process requests in a way that feels instant and doesn't have annoying batching latency.

Batching

Why should we batch?
Batching is a technique that allows us to process multiple LLM inference requests in parallel, and its a core way to maximize token throughput.
Batching comes at the cost of latency.

Optimizations

Continuous Batching with Dynamic SLAs

We've implemented an intelligent batching system that adapts to real-time request patterns. Rather than using fixed batch sizes, our system dynamically adjusts based on incoming traffic and client priorities. This means enterprise users get immediate processing while free tier requests are smartly batched for optimal throughput.

Speculative Decoding

Using the original code as speculation for the decoder. Since most edits preserve significant portions of the input, this gives us a speedup by avoiding redundant computation. The model can quickly verify correct sections and focus computation on the parts that need changes.

Implementation Details

Fused dequantization + attention operations are needed to be useful for minimal latency
Custom CUDA kernels optimized for E5M2 format (in progress)
Adaptive scaling based on attention head statistics
Intelligent cache eviction using access patterns and priority queues

Performance Impact

Neglibible quality degradation on our test suite
2.3x speedup in inference speed

There's a bit more to making this less naive that we won't go into here. Contact us if you're interested in the details. We're currently developing CUDA kernels that fuse dequantization directly into attention computation, eliminating separate dequantization passes. Early benchmarks show an additional 15-20% speedup.

Custom CUDA Kernels (In Progress)

We're working on specialized GPU operations optimized specifically for code editing. Code editinng is an extremely narrow task and we can leverage that to squeeze out performance.

Fused attention operations that minimize memory transfers
Batch-aware implementations that maximize GPU utilization
Custom variants of FlashAttention tuned for our use case

A wise man once said - OK CUDA is hard, good CUDA is harder. We're working on this.

How do you build a fast coding agent on this?

Apply changes in the background, before the user clicks apply. Batched requests will cost less for business users
When a user wants to apply changes directly, process it immediately.

Future Optimizations

We're working on several improvements to make Morph even faster and more accurate:

Adaptive Speculation

Our current speculative edits approach uses a fixed strategy, but we're developing a system that dynamically adjusts speculation based on the type of transformation being performed. For example:

Type hint additions can use aggressive speculation since the changes are usually localized
Refactoring operations may need more conservative speculation due to their complexity
Small formatting changes can use maximum speculation for fastest possible throughput

Near Symmetric Inference

This task is very symmetric - meaning its not inherently prefill leaning (though it is prefill heavy). Input and output are relatively similar in length usually.
This is important because it allows us to make a lot of the optimizations that are useful for near symmetric tasks.

Performance on huge files

We've tested up to 2000 lines with consistent performance.
We're working on pushing this even further. The main limit here is cost of training and inference.

Morph: Enable Faster Coding Agents