Fast Apply Makes Faster Agents | Morph

Cognition's Semi-Async Valley of Death

Image credit: @swyx from Cognition

Small models are the future of agentic AI.

Cognition illustrated this idea really well with their "Semi-Async Valley of Death." We agree with their framing, and we've seen the same thing in practice.

We've worked with almost all of the top vibecoding platforms and conversions boil down to:

Accuracy - did it do what the user wanted (not what they said)
Speed - there's a floor to speed, users that will wait 180s will also wait 10m. Those who truly care about speed see conversion rates roughly double when speeds double - within cohorts that don't run into errors

In order to avoid "the valley of death", work either needs to happen in a few seconds to preserve flow state, or it needs to run autonomously over hours. The middle zone creates friction instead of productivity for humans since the probability of breaking flow increases by 10% per second. A good subagent is one that makes the model faster, more accurate, and preserves the model's content window.

For agentic code systems, this principle forces a design choice:

Fast, specialized models for interactive loops.
Larger, general models for long-horizon autonomy.

The first model we made to achieve faster agents is fast apply.

Fast Apply: Option 3

The most popular form of an agent is a coding agent. While large models are great at thinking about what code needs to be written, merging that code into a file is a messy and failure-prone process.

The agent has 2 options: 1. to rewrite the entire file 2. to use a search and replace tool.

Full file rewrites are wasteful of tokens since files often reach thousands of tokens. Search and replace tools have high failure rates when models need to execute parallel search and replace calls.

At Morph we've introduced an option 3: a small model which your LLM can delegate tasks to for faster and more accurate edits.

Apply Accuracy (What Actually Matters)

Latency alone doesn't make a fast-apply system useful but accuracy is what unlocks real speed. If an apply model fails, the LLM has to re-think, re-generate, and re-issue the edit. That retry loop destroys latency wins and breaks flow. Accuracy is speed.

This is where most "fast" apply approaches fall apart.

When edits fail, the LLM must try again so practical latency spikes.

A real fast-apply model must:

Understand and modify code reliably
Execute edits correctly on the first try
Minimize token overhead for describing changes

The correct benchmark isn't single-model speed. It's end-to-end task completion time.

That includes retries, model output length, and recovery from failures. That's why Morph Fast Apply outperforms: higher accuracy → fewer retries → consistently lower total latency.

Users don't care about tokens/second. They care about how long it takes for what they asked for. That's why wall-clock time is more relevant. When traditional search and replace (claude code) or fast apply fails, models need to retry, increasing the P(breaking flow state) as it happens.

End-to-end time comparison

Accuracy comparison

Benchmark Dataset

For our benchmarks, we created a dataset of 50 repositories. Some of these are open-source libraries or projects, while others are sample vibe-coded apps we took from real world users. We then created a set of 10-20 feature requests or bug fixes that a user might ask an LLM to make for each repository. These looked like:

Can you make the sky orange in this game?
I think the upload functionality isn't working. Can you fix it?
Can you add more logging?

Then, each request was sent to Claude in an environment that represented the average coding agent. Claude was responsible for finding and fixing the bug using the tool we specified. This might be search and replace, full file edits, or third-party apply models. We then ran test cases to verify bug fixes and use an LLM as a judge to verify feature implementations. We also manually looked through randomly selected samples to ensure our evaluations were working correctly. The goal of the setup is to perfectly mimic an agentic coding environment.

The Infra That Got Us to 10,500 tokens/sec

We didn't just train a slightly smaller model but to get here, we engineered a new execution path for code-editing agents:

Custom CUDA kernels fusing attention and feed-forward ops, eliminating redundant memory.
A custom speculative decoding pipeline against the original file, delivering 5× practical speed-ups.
A purpose-built model architecture for merging code, with pruned vocabulary and hierarchical positional encodings for AST-like structure.

Purpose-built beats general-purpose when tasks are scoped. Agent systems are being architected around the minimum compute to complete a scoped task to high fidelity.

The difference comes down to data quality

The open-source Fast Apply dataset which is now used across the ecosystem was originally built by Morph engineers.

We used a straightforward approach: prompt LLMs to generate synthetic training data for code application. The problem? Those synthetic examples don't reflect what real coding agents actually produce. When an agent is making edits, dealing with incomplete context, ambiguous instructions, or complex multi-file changes, the outputs look very different from clean, prompted examples.

At Morph, we built morph-fast-apply with a far more rigorous pipeline. We distill from multiple state-of-the-art models and multi-stage verification with both LLM-based and programmatic checks for reward signal. But the real work is in the details: we spend our time playing with edge cases, constantly testing how the model should behave in tricky scenarios. Is this a minor syntax preference or a meaningful style choice? Is this omission a model error or an intentional deletion? As context grows, agents sometimes forget to include markers for unchanged code. These distinctions matter, and while the space of edge cases is finite, covering them thoroughly requires obsessive attention.

Most importantly, our training data comes from real agent outputs in production, not synthetic "please make an edit" prompts. We continuously retrain every week, incorporating failed edge cases back into the pipeline. This closed loop between real-world usage and model improvement is what makes code application work at production scale.

May the fastest systems win

Human intelligence organizes itself in tiers with the smartest among us at the top delegating work to those with lesser experience. Models will organize in a similar fashion with some of the largest and heaviest models at the very top that delegate work to sub-agents that get increasingly smaller and faster. Morph is building this sub-agent future, with fast apply only being the beginning.

Our goal is to prevent you from seeing a message like this ever again:

Jump Scare