Morph Logo
Back to the main blog

LLMs Are Bad at Being Forced

Posted by Tejas Bhakta

6 minute read


LLMs Are Bad at Being Forced

Imagine you're a world-class pianist. You've spent years training your fingers to flow naturally across keys, developing muscle memory for complex pieces. Now someone asks you to play Chopin—but you must wear thick winter gloves and press each key with a specific finger sequence that feels completely unnatural.

You'd still be able to play the piece, technically. But it would sound worse. Much worse.

This is exactly what happens when we force large language models to output in rigid, unnatural formats. The bigger the gap between how a model wants to express itself and how we force it to respond, the more its performance degrades.

The Natural Flow of Language Models

Language models are fundamentally autoregressive systems. They predict the next token based on all previous tokens, one at a time, left to right. During training, they see billions of examples of natural human text—code, essays, conversations, documentation. They learn the patterns, rhythms, and structures that feel "right."

When Claude or GPT-4 generates code, it wants to write something like this:

def calculate_fibonacci(n):
    if n <= 1:
        return n
    return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)

This flows naturally. The indentation follows human patterns. Variable names read like English. The structure matches what the model has seen millions of times in training.

What Happens When We Force Structure

But modern AI applications rarely let models output naturally. Instead, we force them into rigid schemas through "constrained decoding"—a process that restricts which tokens the model can generate at each step.

Tool calling is the most common example. Instead of letting the model write natural text, we force it to output JSON:

{
  "function_name": "calculate_fibonacci",
  "parameters": {
    "n": 10
  },
  "code": "def calculate_fibonacci(n):\n    if n <= 1:\n        return n\n    return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)"
  }
}

Notice what happened to the code. Those natural newlines became \n escape sequences. Simple quotes became \". The model isn't just writing code anymore—it's writing code as a JSON string, which requires constantly thinking about escaping and formatting.

The Aider Discovery

Recent research from aider.chat quantified this effect beautifully. They tested four top models (Claude 3.5 Sonnet, DeepSeek Coder V2, GPT-4o variants) on 133 coding problems, comparing natural markdown output versus JSON-wrapped code.

The results were striking. Every single model performed worse when forced to wrap code in JSON. Some degraded dramatically—Claude 3.5 Sonnet's performance dropped significantly, despite being perfectly capable of generating valid JSON.

Even more telling: models made more syntax errors in the code itself when asked to JSON-wrap it. They weren't failing at JSON generation—they were writing worse code because the cognitive overhead of JSON formatting interfered with the actual coding task.

Beyond JSON: The Spectrum of Constraint

The JSON case is just one point on a spectrum. The more we force models away from natural expression, the worse they perform.

Consider these increasingly unnatural output formats, ordered by how far they deviate from training data:

  1. Natural text: "Here's the corrected function..." (baseline performance)
  2. Markdown code blocks: "```python\n..." (minimal constraint)
  3. Custom search/replace syntax:
    <<<<<<< SEARCH
    old_code()
    =======
    new_code()
    >>>>>>> REPLACE
    
    (moderate constraint)
  4. JSON-wrapped: {"code": "def foo():\n..."} (moderate constraint)
  5. Binary encoding: Asking for code as executable binary (absurd constraint)

Each step further from natural language creates more cognitive overhead. The model spends processing power on format compliance instead of problem solving.

Why This Matters for Autoregressive Systems

This isn't a bug—it's a fundamental feature of how these systems work. Language models are trained to predict the next token based on probability distributions learned from human text. When we constrain the output space, we're essentially telling the model: "Generate tokens that fit this artificial schema instead."

The autoregressive nature makes this worse. Each constrained token choice affects all subsequent predictions. If the model struggles with JSON escaping early in the response, that confusion compounds throughout the entire generation.

Think of it like speaking a foreign language while doing math. You can do both tasks independently, but combining them creates interference. The more complex the constraints, the more interference.

The Engineering Implications

This has profound implications for AI engineering. The industry has largely moved toward structured outputs because they're easier to parse programmatically. JSON responses are cleaner than parsing freeform text. Tool calling feels more reliable than prompt engineering.

But we're trading ease of integration for quality of output. Every layer of constraint we add makes the model slightly worse at the task we actually care about.

The most effective AI systems will be those that minimize the delta between natural model behavior and required output format. Instead of forcing models into rigid schemas, we should design systems that work with how models naturally want to express themselves.

This doesn't mean abandoning structure entirely. It means being thoughtful about when structure is worth the performance cost, and designing constraints that feel as natural as possible given how these systems are trained.

The Future of Human-AI Interaction

As we build more sophisticated AI systems, this principle becomes increasingly important. The models that seem "smartest" aren't necessarily the ones with the most parameters—they're the ones operating closest to their natural mode of expression.

The best AI interfaces will feel conversational, not transactional. They'll let models express uncertainty, reasoning, and creativity in natural ways, then extract the structured data we need from that rich output.

Because ultimately, LLMs aren't databases or APIs. They're language systems trained on human communication. The closer we let them stay to that natural mode, the better they'll serve us.

Building Infrastructure Around Natural Model Behavior

At Morph, we agree with Cursor's engineering philosophy: build your infrastructure around how autoregressive models want to output, rather than forcing models to constrain their generation.

This insight drove us to create Morph Fast Apply—a system designed to work with natural model behavior instead of against it. Rather than forcing Claude or Gemini to output rigid JSON schemas or custom formats, we let them generate code changes in their preferred style, then handle the complex merging process ourselves.

The result? We can apply LLM-generated edits at 2000+ tokens per second—faster, better, and cheaper than traditional constraint-based approaches. By removing the cognitive overhead of format compliance from the language model, we get higher quality code changes while dramatically improving speed.

This isn't just a performance optimization—it's a fundamental architectural choice. When you design systems that work with how models naturally operate, you unlock their full potential instead of fighting against their training.


Want to dive deeper into how output constraints affect model performance? The aider.chat team has published extensive benchmarks on this topic, with quantitative results across multiple state-of-the-art models.