
Optimizing Models to Be Fast at Codegen
Three places the open inference stack quits, and what we build past each. A speculator trained on the model's own diffs: a generic draft gets 1.93x, a trained one 3.07x. An autoresearch loop for kernels on $7K GPUs: 97 to 162 tok/s. A prefix cache that crosses NVLink-denied boxes over plain TCP.





































