SWE-bench Pro
Leaderboard view. Green bars use Warp Grep, outlined bars are baseline.
SWE-bench Pro Improvement
With Warp Grep
Baseline
15%
Average Cost Reduction
19%
Average Time Reduction
28
Turns Saved on Average
Detailed Performance Improvement
We ran the official SWE-bench Pro benchmark using MiniMax with both baseline and Warp Grep runs.
Warp Grep can make an open-source model outperform the frontier ones.
Sweep: MiniMax 2.5
| Metric | Baseline | Warp Grep | Delta |
|---|---|---|---|
| Avg events/instance | 157 | 135 | -14% |
| Avg prompt tokens | 2,926,502 | 2,461,973 | -16% |
| Avg completion tokens | 17,190 | 15,222 | -11% |
| Avg reasoning tokens | 7,347 | 6,835 | -7% |
| Avg cost/instance | $0.18 | $0.15 | -17% |
| Total cost (18 inst) | $3.26 | $2.77 | -15% |
Warp Grep makes small open-source models beat frontier ones.
Agent Capabilities Improvement
SWE-bench evaluation with Claude 4.5 Opus — WarpGrep as code search tool vs. without. Better search directly improves agent effectiveness.
Input Tokens
39% fewer
14K9K
Agent Turns
26% fewer
35.026.0
Tasks Solved
10% more
74.4%81.9%
Input Tokens39% fewer
Without WarpGrep
14K
With WarpGrep
9K
Agent Turns26% fewer
Without WarpGrep
35.0
With WarpGrep
26.0
Tasks Solved10% more
Without WarpGrep
74.4%
With WarpGrep
81.9%
Build better coding agents
WarpGrep is available as an API and SDK component. Join 500+ teams using Morph.