`s, SVG icons, and tracking pixels. The model must find 5-10 meaningful actions in 50,000 tokens of angle brackets. **Hidden elements.** Raw HTML contains *every* element, including mobile drawers, collapsed menus, and off-screen widgets. A typical page has 300+ interactive elements in the DOM; only 80-100 are visible. The model will confidently suggest clicking a hamburger menu that's invisible on desktop. **Selectors are still free-form.** Even reading real attributes, the model generates selectors as text. It sees `class="sc-1x2f3y4 kJmWbR"` and writes `button.add-to-cart`. CSS-in-JS hashes bear no semantic relationship to purpose. And `a[href="/products"]` is valid CSS but matches 34 elements, an ambiguity the model can't detect. ### Accessibility tree Playwright's `aria_snapshot()` gives ~4K tokens of semantic structure instead of 50K. But it loses critical information: a hero CTA and footer link are structurally identical `- link "Shop All"`). Many elements lack labels entirely, and 30-40% of interactive elements on real sites are invisible to the tree. And the tree contains no CSS selectors and no spatial information. You can't target elements or reason about layout. ### Screenshot VLMs reason well about visual layout. They grasp that the big orange button is a CTA and the search bar is in the header. But a screenshot is pixels. You can't address a DOM element from an image alone. ### The core tension Each representation captures something the others miss. And across all three, the model is solving two fundamentally different tasks in one output: *decide what to interact with* (reasoning) and *produce a selector to target it* (mechanical precision). One is what LLMs are good at. The other is what they're bad at. ## The Element Reference Pattern The insight: stop asking the model to generate selectors. Instead, **pre-extract visible elements, assign each a number, and let the model point by reference.** Before the LLM call, a JavaScript extraction script runs in the browser: find every interactive element, filter out hidden ones, generate a deterministic CSS selector for each, and return a numbered list. Send this alongside a screenshot (for visual reasoning) and the accessibility tree (for semantic structure). The model returns element numbers, not selectors: ```plaintext 1. [link] "Shop All" selector: a[href="/collections/all"] position: (120, 45) size: 80x24 2. [button]