# Grounding Vision-Language Agents in the DOM: The Element Reference Pattern

The browser is the universal application runtime, yet programmatic interaction with web interfaces has remained remarkably brittle. Selenium scripts break when a class name changes. Scrapers fail when a site redesigns.

Vision-language models (VLMs) like Claude, GPT-4V, and Gemini offer a different approach: give a model a screenshot and let it *see* what a human sees. This has sparked a wave of autonomous browser agents that combine VLMs with Playwright to navigate websites and complete tasks without hand-written scripts.

We've been building one of these. Ours is an exploration agent that, given a URL, discovers every meaningful action on the page, executes them, and recursively maps the tree of reachable states. The output is a complete graph of user journeys: homepage to product listing to product detail to cart to checkout. This is harder than single-task agents because there's no predefined goal. The agent must *discover* what's possible, and every page presents fresh decisions: which elements are interactive, which are worth exploring, and how to click each one.

That last part is where things break down.

## What You Actually Send the Model

A browser agent has to solve two problems for every action it takes:

1\. **Semantic understanding**: What are the meaningful actions on this page? Which button is the primary CTA? Which links lead to new content vs. which are decorative?

2\. **Mechanical execution**: Given a specific element the agent wants to interact with, produce a way to target it in the DOM and click it.

VLMs are extraordinarily good at problem #1. The question is how you give them enough information about the page to do it and whether the same representation can also solve problem #2. In practice, there are three things you can send the model. Each seems promising. Each breaks in its own way.

### Full page HTML

The obvious first attempt. Send the DOM. The problems are immediate:

**Size and noise.** A typical e-commerce page is 200-500KB of markup, still 50K+ tokens after stripping scripts and styles. Most of it is nested `<div>`s, SVG icons, and tracking pixels. The model must find 5-10 meaningful actions in 50,000 tokens of angle brackets.

**Hidden elements.** Raw HTML contains *every* element, including mobile drawers, collapsed menus, and off-screen widgets. A typical page has 300+ interactive elements in the DOM; only 80-100 are visible. The model will confidently suggest clicking a hamburger menu that's invisible on desktop.

**Selectors are still free-form.** Even reading real attributes, the model generates selectors as text. It sees `class="sc-1x2f3y4 kJmWbR"` and writes `button.add-to-cart`. CSS-in-JS hashes bear no semantic relationship to purpose. And `a[href="/products"]` is valid CSS but matches 34 elements, an ambiguity the model can't detect.

### Accessibility tree

Playwright's `aria_snapshot()` gives ~4K tokens of semantic structure instead of 50K. But it loses critical information: a hero CTA and footer link are structurally identical `- link "Shop All"`). Many elements lack labels entirely, and 30-40% of interactive elements on real sites are invisible to the tree. And the tree contains no CSS selectors and no spatial information. You can't target elements or reason about layout.

### Screenshot

VLMs reason well about visual layout. They grasp that the big orange button is a CTA and the search bar is in the header. But a screenshot is pixels. You can't address a DOM element from an image alone.

### The core tension

Each representation captures something the others miss. And across all three, the model is solving two fundamentally different tasks in one output: *decide what to interact with* (reasoning) and *produce a selector to target it* (mechanical precision). One is what LLMs are good at. The other is what they're bad at.

## The Element Reference Pattern

The insight: stop asking the model to generate selectors. Instead, **pre-extract visible elements, assign each a number, and let the model point by reference.**

Before the LLM call, a JavaScript extraction script runs in the browser: find every interactive element, filter out hidden ones, generate a deterministic CSS selector for each, and return a numbered list. Send this alongside a screenshot (for visual reasoning) and the accessibility tree (for semantic structure). The model returns element numbers, not selectors:

```plaintext
1. [link] <a> "Shop All"
   selector: a[href="/collections/all"]
   position: (120, 45) size: 80x24

2. [button] <button> "Add to Cart"
   selector: #add-to-cart-btn
   position: (450, 620) size: 200x48
```

The model returns `element_ref: 2` to click "Add to Cart." It doesn't need to generate the selector because the system already has one. Reasoning and targeting are completely separated.

The screenshot uses a scroll-and-stitch capture rather than Playwright's `full_page=True` (which breaks viewport units and duplicates sticky headers), and also triggers lazy-loaded content. The accessibility tree is truncated to ~4K tokens. The element list typically contains 80-100 entries after visibility filtering removes the 40-60% of DOM elements that aren't actually clickable.

## The Selector Waterfall

he element list's value depends entirely on selector quality. Our extraction uses a 12-strategy waterfall where each strategy generates a candidate and verifies it matches exactly one element via `querySelectorAll`. The first unique match wins:

```plaintext
# Brief idea

Strategy 1-3:   #id, [data-testid], [data-test]         (most resilient)
Strategy 4-7:   a[href], [name], [aria-label], [data-*]  (semantic attributes)
Strategy 8-9:   #ancestor descendant, tag.class           (scoped combinations)
Strategy 10-12: nth-of-type, class paths, structural      (positional fallback)
```

The ordering is a deliberate tradeoff: early strategies produce selectors that survive deploys (IDs, test attributes); later strategies are fragile but guaranteed to produce a unique match for any element. For shadow DOM, the script recursively traverses open shadow roots and composes selectors as `hostSelector innerSelector`. Visibility checks cross shadow boundaries to catch hidden drawers and overlays.

## Element Resolution

The model returns actions with an `element_ref` (usually correct) and a `selector` (often hallucinated). The resolution layer ignores the model's selector and looks up the real one.

```python
if action.element_ref is not None:
  idx = action.element_ref - 1
    if 0 <= idx < len(elements):
      action = action.model_copy(update={"selector": elements[idx].selector})
```

About 60% of actions need selector correction because the model writes `button.add-to-cart` when the real selector is `#product-widget button[data-action="addToCart"]`. This is fine. The element reference is all that matters.

Selectors also pass ambiguity detection (rejecting 50+ matches) and visibility verification. When a selector matches 2-5 elements, the system refines it by re-running the waterfall on the resolved element, producing a unique selector for deterministic replay.

## Key Takeaways

**LLMs and deterministic code have complementary strengths that shouldn't be mixed.** Asking a model to generate a CSS selector is asking it to do something a 50-line JavaScript function does better. Asking JavaScript to decide which actions matter is asking it to do something a vision model does better.

The model outputs an integer. The system turns it into a click. When one side fails, the failure is isolated and fixable without touching the other.

This generalizes beyond browsers. Any system using LLMs to drive actions in a structured environment faces the same tension between flexible reasoning and mechanical precision. Let the model reason about *what*. Let deterministic code handle *how*.
