Research

Learning to Test Code Changes with RL

Traditional testing tells you if code runs. But how do you verify that UI changes actually work—that buttons are clickable, forms submit, and components render correctly after a diff?

We trained a browser agent to explore and verify code changes using a simple reward policy: reward the model for bringing modified components into view, and reward engagement with those components. The result is an agent that learns to systematically test the parts of your UI that actually changed.

This post covers our approach to reward shaping for UI verification, how we use React Fiber to identify changed components, and the surprisingly simple policy that makes it work.

The Diff Detection Problem

Consider a typical pull request: three files modified, touching a header component, a form field, and a submit button. Unit tests pass. Types check. But does the UI actually work?

The challenge is grounding—connecting code changes to their visual manifestation in the browser. A diff in Button.tsx doesn't tell you where that button renders, or how to interact with it.

We solved this using React's Fiber tree. Every React application maintains an internal representation of its component hierarchy. By walking this tree, we can map each changed file to its rendered DOM elements—complete with bounding boxes and interaction handlers.

// Fiber node contains everything we need
FiberNode {
  type: Button,              // Component that changed
  stateNode: HTMLElement,    // Rendered DOM element
  memoizedProps: { ... },    // Current props
  _debugSource: {            // Source file mapping
    fileName: "Button.tsx",
    lineNumber: 42
  }
}

Tools like Bippy from Aiden Bai and react-grab made this introspection practical. We can now answer: "Given this diff, which pixels on screen need testing?"

From Diffs to Bounding Boxes

The mapping process runs in three stages. First, we parse the git diff to extract changed files and line ranges. Second, we traverse the Fiber tree to find components whose source locations intersect with the diff. Third, we compute bounding boxes for each matched component.

interface ChangedComponent {
  file: string;           // "src/components/Button.tsx"
  component: string;      // "Button"
  boundingBox: DOMRect;   // { x, y, width, height }
  interactable: boolean;  // Has onClick, onSubmit, etc.
  coverage: number;       // 0-1, how much is in viewport
}

This gives us a structured representation of what needs testing. A three-file diff might produce seven changed components, each with a known location and interaction surface.

The coverage field is critical—it tracks what percentage of each component is currently visible in the viewport. This becomes a key signal for our reward function.

The Reward Policy

The core of our reward function combines coverage reward and engagement reward. The intuition is simple—first get the changed component on screen, then interact with it.

Note: This formulation represents a significant portion of our reward signal, but not the complete picture. Additional terms for novelty, coherence, and task completion shape the full policy.

R(s,a)=cCΔcov(c)+1[cov(c)>0.9]0.5coverage reward+1[aI](2+1[Δs])engagement rewardϵR(s, a) = \underbrace{\sum_{c \in \mathcal{C}} \Delta \text{cov}(c) + \mathbb{1}[\text{cov}(c) > 0.9] \cdot 0.5}_{\text{coverage reward}} + \underbrace{\mathbb{1}[a \in \mathcal{I}] \cdot \left(2 + \mathbb{1}[\Delta s]\right)}_{\text{engagement reward}} - \epsilon

where:

  • C\mathcal{C} = set of changed components from the diff
  • Δcov(c)\Delta \text{cov}(c) = change in viewport coverage for component cc
  • I\mathcal{I} = set of interactions targeting changed components
  • Δs\Delta s = indicator for state change after action
  • ϵ=0.01\epsilon = 0.01 = small time penalty for efficiency

The coverage term encourages exploration—scrolling, navigating, expanding collapsed sections. The engagement term rewards interaction—clicking buttons, filling forms, triggering handlers.

We found the 2:1 ratio (engagement:coverage) works well. Too much weight on coverage and the agent just scrolls past everything. Too much on engagement and it clicks randomly without finding changes first.

Beyond Code Coverage

Code coverage is one small part of testing code changes. A line being executed tells you nothing about whether the user experience is correct. You can have 100% line coverage and still ship a form that doesn't submit, a button that's invisible, or a modal that traps focus.

The deeper problem is that traditional testing frameworks optimize for the happy path. They verify that correct inputs produce correct outputs. But real users don't follow happy paths—they make mistakes, change their minds, and interact with your UI in ways you didn't anticipate.

What code coverage misses:

  • Validation states that only appear on invalid input
  • Error boundaries that only trigger on unexpected data
  • Loading states that flash too quickly to see
  • Race conditions between user actions

To truly verify a code change, you need an agent that explores the space of possible interactions—not just the intended ones.

Learning to Fail

Here's the counterintuitive part: to test UI changes effectively, the agent needs to act incorrectly. Partially filling out a form and hitting submit. Clicking a button twice in rapid succession. Navigating away mid-action. These aren't edge cases—they're the first things real users do.

This is the opposite of what frontier models are trained to do. GPT-5, Claude, and Gemini 3 are all optimized to be helpful and correct. When they see a login form, their instinct is to fill it out properly. They don't naturally think to submit with an empty password field, or to click "forgot password" mid-login, or to paste an email with trailing whitespace.

# Adversarial action sampling
def sample_action(state, changed_components):
    if random() < 0.3:  # 30% adversarial actions
        return sample_adversarial_action(state)
    else:
        return policy.sample(state)

def sample_adversarial_action(state):
    actions = [
        partial_form_submit,    # Submit with missing fields
        rapid_double_click,     # Click same element twice
        mid_action_navigate,    # Leave during async operation
        invalid_input_type,     # Wrong data type in field
        keyboard_interrupt,     # Escape key during modal
    ]
    return random.choice(actions)

We explicitly train our agent to explore these failure modes. The reward function doesn't just credit engagement—it credits novel state discovery. Finding an error message the agent hasn't seen before is worth more than another successful form submission.

This is why we couldn't just prompt an existing model. We needed to train our own policy that treats "incorrect" actions as first-class exploration strategies, not mistakes to avoid.

Training Loop

We train on a mix of synthetic diffs, real PRs from open-source projects, and a private dataset we've built in collaboration with our customers. Each episode applies a diff and the agent explores until all changed components have been seen and interacted with.

Episode structure:
1. Sample diff from corpus
2. Apply diff to base application
3. Agent explores until termination
4. Compute cumulative reward
5. Update policy via PPO
After 10K episodes, the agent achieves 94% coverage on held-out diffs

What This Enables

The trained agent can take any PR and systematically verify that the changed code actually works in the browser. It's not just checking that code runs—it's checking that users can interact with it.

In practice, this catches a class of bugs that traditional testing misses:

  • Z-index issues — Components render but are covered by other elements
  • Event handler bugs — Buttons appear but onClick doesn't fire
  • Scroll traps — Components exist but can't be scrolled into view
  • Responsive breakage — Works on desktop, broken on mobile viewports

The key insight is that code coverage isn't enough. You need viewport coverage—proof that changes are visible and interactive in the actual browser environment.

Try the demo →

Hover over files in the component tree to see their bounding boxes. Scroll through this article to watch the agent's coverage progress through each stage of verification.

Try PR Testing on Your Repo

Install our GitHub App to get automated browser verification on every pull request.

Install GitHub App
example.com
Example Website
Changed Components
0/7 covered
Initializing agent...
components
Header.tsx
SignInButton.tsx
LoginCard.tsx
FormField.tsx
Input.tsx
Button.tsx
Footer.tsx
styles
hooks
index.ts