Skip to main content

Command Palette

Search for a command to run...

Grounding Vision-Language Agents in the DOM: The Element Reference Pattern

Published
6 min read
Grounding Vision-Language Agents in the DOM: The Element Reference Pattern
A

Hi there! 👋 I'm Aditya Chaturvedi, a passionate software engineer who loves solving problems, building creative solutions, and sharing knowledge with others. I started this blog, as a way to document my journey in technology, explore exciting ideas, and connect with like-minded individuals. Whether it's coding tips, mathematics, AI, or musings on the ever-evolving tech landscape, you'll find it all here.

The browser is the universal application runtime, yet programmatic interaction with web interfaces has remained remarkably brittle. Selenium scripts break when a class name changes. Scrapers fail when a site redesigns.

Vision-language models (VLMs) like Claude, GPT-4V, and Gemini offer a different approach: give a model a screenshot and let it see what a human sees. This has sparked a wave of autonomous browser agents that combine VLMs with Playwright to navigate websites and complete tasks without hand-written scripts.

We've been building one of these. Ours is an exploration agent that, given a URL, discovers every meaningful action on the page, executes them, and recursively maps the tree of reachable states. The output is a complete graph of user journeys: homepage to product listing to product detail to cart to checkout. This is harder than single-task agents because there's no predefined goal. The agent must discover what's possible, and every page presents fresh decisions: which elements are interactive, which are worth exploring, and how to click each one.

That last part is where things break down.

What You Actually Send the Model

A browser agent has to solve two problems for every action it takes:

1. Semantic understanding: What are the meaningful actions on this page? Which button is the primary CTA? Which links lead to new content vs. which are decorative?

2. Mechanical execution: Given a specific element the agent wants to interact with, produce a way to target it in the DOM and click it.

VLMs are extraordinarily good at problem #1. The question is how you give them enough information about the page to do it and whether the same representation can also solve problem #2. In practice, there are three things you can send the model. Each seems promising. Each breaks in its own way.

Full page HTML

The obvious first attempt. Send the DOM. The problems are immediate:

Size and noise. A typical e-commerce page is 200-500KB of markup, still 50K+ tokens after stripping scripts and styles. Most of it is nested <div>s, SVG icons, and tracking pixels. The model must find 5-10 meaningful actions in 50,000 tokens of angle brackets.

Hidden elements. Raw HTML contains every element, including mobile drawers, collapsed menus, and off-screen widgets. A typical page has 300+ interactive elements in the DOM; only 80-100 are visible. The model will confidently suggest clicking a hamburger menu that's invisible on desktop.

Selectors are still free-form. Even reading real attributes, the model generates selectors as text. It sees class="sc-1x2f3y4 kJmWbR" and writes button.add-to-cart. CSS-in-JS hashes bear no semantic relationship to purpose. And a[href="/products"] is valid CSS but matches 34 elements, an ambiguity the model can't detect.

Accessibility tree

Playwright's aria_snapshot() gives ~4K tokens of semantic structure instead of 50K. But it loses critical information: a hero CTA and footer link are structurally identical - link "Shop All"). Many elements lack labels entirely, and 30-40% of interactive elements on real sites are invisible to the tree. And the tree contains no CSS selectors and no spatial information. You can't target elements or reason about layout.

Screenshot

VLMs reason well about visual layout. They grasp that the big orange button is a CTA and the search bar is in the header. But a screenshot is pixels. You can't address a DOM element from an image alone.

The core tension

Each representation captures something the others miss. And across all three, the model is solving two fundamentally different tasks in one output: decide what to interact with (reasoning) and produce a selector to target it (mechanical precision). One is what LLMs are good at. The other is what they're bad at.

The Element Reference Pattern

The insight: stop asking the model to generate selectors. Instead, pre-extract visible elements, assign each a number, and let the model point by reference.

Before the LLM call, a JavaScript extraction script runs in the browser: find every interactive element, filter out hidden ones, generate a deterministic CSS selector for each, and return a numbered list. Send this alongside a screenshot (for visual reasoning) and the accessibility tree (for semantic structure). The model returns element numbers, not selectors:

1. [link] <a> "Shop All"
   selector: a[href="/collections/all"]
   position: (120, 45) size: 80x24

2. [button] <button> "Add to Cart"
   selector: #add-to-cart-btn
   position: (450, 620) size: 200x48

The model returns element_ref: 2 to click "Add to Cart." It doesn't need to generate the selector because the system already has one. Reasoning and targeting are completely separated.

The screenshot uses a scroll-and-stitch capture rather than Playwright's full_page=True (which breaks viewport units and duplicates sticky headers), and also triggers lazy-loaded content. The accessibility tree is truncated to ~4K tokens. The element list typically contains 80-100 entries after visibility filtering removes the 40-60% of DOM elements that aren't actually clickable.

The Selector Waterfall

he element list's value depends entirely on selector quality. Our extraction uses a 12-strategy waterfall where each strategy generates a candidate and verifies it matches exactly one element via querySelectorAll. The first unique match wins:

# Brief idea

Strategy 1-3:   #id, [data-testid], [data-test]         (most resilient)
Strategy 4-7:   a[href], [name], [aria-label], [data-*]  (semantic attributes)
Strategy 8-9:   #ancestor descendant, tag.class           (scoped combinations)
Strategy 10-12: nth-of-type, class paths, structural      (positional fallback)

The ordering is a deliberate tradeoff: early strategies produce selectors that survive deploys (IDs, test attributes); later strategies are fragile but guaranteed to produce a unique match for any element. For shadow DOM, the script recursively traverses open shadow roots and composes selectors as hostSelector innerSelector. Visibility checks cross shadow boundaries to catch hidden drawers and overlays.

Element Resolution

The model returns actions with an element_ref (usually correct) and a selector (often hallucinated). The resolution layer ignores the model's selector and looks up the real one.

if action.element_ref is not None:
  idx = action.element_ref - 1
    if 0 <= idx < len(elements):
      action = action.model_copy(update={"selector": elements[idx].selector})

About 60% of actions need selector correction because the model writes button.add-to-cart when the real selector is #product-widget button[data-action="addToCart"]. This is fine. The element reference is all that matters.

Selectors also pass ambiguity detection (rejecting 50+ matches) and visibility verification. When a selector matches 2-5 elements, the system refines it by re-running the waterfall on the resolved element, producing a unique selector for deterministic replay.

Key Takeaways

LLMs and deterministic code have complementary strengths that shouldn't be mixed. Asking a model to generate a CSS selector is asking it to do something a 50-line JavaScript function does better. Asking JavaScript to decide which actions matter is asking it to do something a vision model does better.

The model outputs an integer. The system turns it into a click. When one side fails, the failure is isolated and fixable without touching the other.

This generalizes beyond browsers. Any system using LLMs to drive actions in a structured environment faces the same tension between flexible reasoning and mechanical precision. Let the model reason about what. Let deterministic code handle how.