Morph LogoMorph
Back to the main blog

Morph: Enable Faster Coding Agents

A deep dive into how we get morph to be so fast.

Posted by Tejas Bhakta

6 minute read


Why speed matters

Morph is a specialized LLM that is trained to be world class at applying changes to code and files. We've been working on a similar approach to Cursor Composer and Apply, but with a focus on speed and accuracy. Our approach is to use a specialized model to apply the changes.

We've achieved:

  • 1000 tokens/s processing speed
  • Consistent performance across file sizes up to 1500 lines

From an ML perspective, this is not an unsolved problem. GPT 4o, and Claude Sonnet are both capable of applying edits with the right prompt and sufficient yelling

The hard part is that we need to do this fast, cheap, and at scale. Using Frontier models to apply changes to files is not practical from a cost and latency perspective. Coding Agents do not have that good product "feel without this.

GPUMemory (GB)Memory bandwidth (GB/s)FP16 Tensor Core (TFLOP/s)
A1024600125
A100 PCIe/SXM801935/2039312
L424300121
L40S48864362
H100 PCIe/SXM802000/3350756/989
H200 SXM1414800989

Performance Improvements

Our approach leads to significant speedups:

  • 5-8x faster than traditional inference
  • Near-instant results

There are 3 main things we can leverage:

  • The task is very narrow and we can optimize for that
  • The input and output are very similar in content, so we can use speculation to speed up inference
  • The input and output length are often similar

The Problem

The naive approach to building an LLM code modification system is straightforward but inefficient:

  1. Load model into GPU memory
  2. For each request: tokenize input, run inference, generate tokens, detokenize output
  3. Return modified code

While this works for demos, it hits major bottlenecks:

  • Poor GPU utilization with idle time between requests
  • Sequential processing creates queues
  • Fixed batch size misses parallelization opportunities
  • Memory inefficiency from reserving full GPU per request

This typically achieves only 300-400 tokens/second at useful model sizes on a A100/H100 and scales linearly with cost, making it impractical for a production service. We need something much faster.

Optimization Approaches

We've optimized the inference pipeline for speed. This is by far the hardest part of making Morph work well. To have a free tier, we need to be able to process requests in a way that feels instant and doesn't have annoying batching latency.

Batching

  • Why should we batch?
  • Batching is a technique that allows us to process multiple LLM inference requests in parallel, and its a core way to maximize token throughput.
  • Batching comes at the cost of latency.

Optimizations

Continuous Batching with Dynamic SLAs

We've implemented an intelligent batching system that adapts to real-time request patterns. Rather than using fixed batch sizes, our system dynamically adjusts based on incoming traffic and client priorities. This means enterprise users get immediate processing while free tier requests are smartly batched for optimal throughput.

Speculative Decoding

Using the original code as speculation for the decoder. Since most edits preserve significant portions of the input, this gives us a speedup by avoiding redundant computation. The model can quickly verify correct sections and focus computation on the parts that need changes.

Quantization-Aware KV Cache

Memory management is crucial for scalable fast inference. Our KV cache system implements advanced quantization techniques:

KV Cache Quantization Format Analysis

We've extensively tested two key formats:

  • E5M2 (5 exponent, 2 mantissa bits)

    • Dynamic range: ±57344.0
    • Better suited for attention values with large dynamic ranges
    • Lower precision but requires less scaling overhead
    • Direct compatibility with existing CUDA kernels
  • E4M3FN (4 exponent, 3 mantissa bits)

    • Dynamic range: ±240.0
    • Higher precision for small/medium values
    • Requires FP32 scaling factors per tensor
    • Better numerical stability for attention patterns
Implementation Details
  • Fused dequantization + attention operations are needed to be useful for minimal latency
  • Custom CUDA kernels optimized for E5M2 format (in progress)
  • Adaptive scaling based on attention head statistics
  • Intelligent cache eviction using access patterns and priority queues
Performance Impact
  • 4x memory reduction vs FP16
  • Neglibible quality degradation on our test suite
  • 2.3x reduced memory bandwidth usage
  • Negligible latency overhead from quantization/dequantization

There's a bit more to making this less naive that we won't go into here. Contact us if you're interested in the details. We're currently developing CUDA kernels that fuse dequantization directly into attention computation, eliminating separate dequantization passes. Early benchmarks show an additional 15-20% speedup.

Custom CUDA Kernels (In Progress)

We're working on specialized GPU operations optimized specifically for code editing. Code editinng is an extremely narrow task and we can leverage that to squeeze out performance.

  • Fused attention operations that minimize memory transfers
  • Batch-aware implementations that maximize GPU utilization
  • Custom variants of FlashAttention tuned for our use case

A wise man once said - OK CUDA is hard, good CUDA is harder. We're working on this.

How do you build a fast coding agent on this?

  • Apply changes in the background, before the user clicks apply. Batched requests will cost less for business users
  • When a user wants to apply changes directly, process it immediately.

Future Optimizations

We're working on several improvements to make Morph even faster and more accurate:

Adaptive Speculation

Our current speculative edits approach uses a fixed strategy, but we're developing a system that dynamically adjusts speculation based on the type of transformation being performed. For example:

  • Type hint additions can use aggressive speculation since the changes are usually localized
  • Refactoring operations may need more conservative speculation due to their complexity
  • Small formatting changes can use maximum speculation for fastest possible throughput

Near Symmetric Inference

  • This task is very symmetric - meaning its not inherently prefill leaning (though it is prefill heavy). Input and output are relatively similar in length usually.
  • This is important because it allows us to make a lot of the optimizations that are useful for near symmetric tasks.

Performance on huge files

  • We've tested up to 2000 lines with consistent performance.
  • We're working on pushing this even further. The main limit here is cost of training and inference.