source: Refik Anadol

But How Does OpenAI Operator Even Work?

George Salapa
4 min readJan 25, 2025

No black magic, really, just a DOM parser with rectangle mapping.

A few months ago I attempted to create a similar web-agent. (link: https://x.com/jurajsalapa/status/1853566697522798836)

So how does the code work?

First, a headless instance of browser is launched (e.g. using playwright).

We then implement a page observation system that processes the current website’s DOM. Modern web apps are complex — buried under layers of JavaScript, dynamic content, shadow DOMs, and cross-origin iframes. Our system filters this complexity down to actionable elements — those our agent can meaningfully interact with through clicks or text input.

The system then calculates element bounds using Chrome DevTools Protocol (CDP). This gives us precise coordinates and dimensions for each interactive element, essential for visual identification. (Don’t worry about it just yet).

Next, we create element mappings. This step builds a bridge between visual elements and their DOM selectors, prioritizing accessibility attributes (aria-labels, roles), then falling back to text content and structural selectors.

We take a screenshot of the website we are currently on and draw boxes around elements agent can interact with (here is where we use the element bounds from earlier).

This is the step that allows our agent see what we see by creating an image like below which is sent to vision model.

Now imagine you have an agentic loop (see also my article on agent building), where model is looping in conversation with you, having at its disposal several tools. Atop each loop you run the script above to get filtered DOM elements, playwright instructions (which model already knows from training, so you just remind her) and send this together with the image with rectangles like above to multimodal model. In one step it will describe the image and understand what it needs to do in order to achieve your desired goal, based also on steps it has done before.

It then must have a tool, which allows it to act on that input. So we give it some browser tool, which allows it to directly send commands to playwright.

And that is it.

The Rise of AI Operators & Evolution of Web Architecture

Modern web is built for humans — a messy, resilient interface that assumes human ability to parse visual context, understand inconsistent patterns, and adapt to changing layouts. This works great for us but creates significant challenges for AI operators attempting to navigate and interact with websites programmatically. The web’s complexity — layers of JavaScript frameworks, dynamic content loading, anti-bot measures, and inconsistent interaction patterns — makes many sites essentially invisible or impenetrable to automation scripts.

QWERTY Operator (or our web-agent above) is an automation system bridging this gap by doing what humans do — it “looks” at the page, understands visual context, and interacts with elements it can “see”. But this is just a workaround. You will see many tweet about how operator stumbled on black ‘invisible’ website.

Just like the mobile revolution forced sites to become “mobile-first”, the AI revolution may be pushing us towards an “operator-first” design. This means clean HTML semantics, consistent interaction patterns, clear accessibility markers, and standardized ways to declare element purposes and relationships. Sites that will ignore this, risk dying out.

The implications go beyond technical implementation. When operators become primary users alongside humans, we need to rethink core web architecture. Rate limiting needs to become negotiation-based rather than binary blocks. CAPTCHAs need alternatives that verify legitimate automation. Authentication systems need to handle both human and operator contexts. Most importantly, websites need to start exposing their functionality in ways that are consistently parseable by machines while maintaining human usability. This isn’t just about adding APIs — it’s about making the web interface itself machine-comprehensible by design.

The sites that adapt first will have significant advantages in the emerging operator-driven web ecosystem.

--

--

George Salapa
George Salapa

Written by George Salapa

Thoughts on technology, coding, money & culture. Wrote for Forbes and Venturebeat before.

No responses yet