Alibaba's Page-Agent: Control Any Website With Natural Language

Alibaba’s page-agent is a JavaScript library that embeds an AI agent directly into any web page, enabling natural language control of the DOM—no browser extensions, Python scripts, or headless Chrome instances required. With 8,600+ GitHub stars accumulated since its public release, it represents a meaningfully different approach to web automation: the missing middleware layer between LLMs and the existing web.

What Is Page-Agent?

Page-agent is an open-source TypeScript library published by Alibaba that turns any web interface into a surface an LLM can operate. The project, hosted at alibaba/page-agent on GitHub, is MIT-licensed and available via npm.

Unlike headless browser frameworks such as Playwright or Puppeteer—which automate a separate browser instance from outside—page-agent embeds directly into the running page via a <script> tag or npm import. The agent lives inside the user’s browser session. It sees the DOM the user sees. It acts with the permissions the user already has.

The practical consequence: a developer can ship an AI copilot into an existing web application with roughly a dozen lines of code and no backend rewrite.

import { PageAgent } from 'page-agent';

const agent = new PageAgent({
  model: 'qwen3.5-plus',
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1'
});

await agent.execute('Find the highest-priority open ticket and assign it to Alice');

As of March 2026, the project is on v1.5.7 across 22 releases, with 675 forks and active development.¹

How Does Page-Agent Work?

The architecture implements a tight Observe–Think–Act loop across three stages:

Observe: The PageController component extracts the current DOM state, converting the page into a simplified HTML representation with indexed interactive elements stripped of visual noise.
Think: This text representation, combined with task history, is passed to the configured LLM. The model reasons about what action to take next.
Act: Selected tools execute synthetic DOM operations—clicks, form fills, scrolls, navigation—against the live page.

Each step issues a fresh LLM call with updated page state, making the system reactive to dynamic changes like modals, loading spinners, and paginated tables.

The project is a TypeScript monorepo with seven packages: page-agent (user-facing), @page-agent/core, @page-agent/llms, @page-agent/page-controller, @page-agent/ui, @page-agent/extension (for multi-tab coordination), and @page-agent/website.²

Why Does the In-Page Architecture Matter?

Every other major web automation approach runs outside the browser:

Playwright/Puppeteer: Automate a separate browser instance via WebDriver; require Node.js infrastructure and explicit credential management
Browser-use: Python-based agent framework running in a headless Chrome process
OpenAI’s Computer Use / Anthropic’s Computer Use: Screenshot-based visual agents operating from the OS level

Page-agent’s in-page position eliminates an entire category of operational friction. The agent inherits the authenticated session already open in the user’s browser. There is no separate credential store, no cookie synchronization problem, no TLS interception layer to maintain.

For developers shipping internal tooling—think ERP dashboards, customer support interfaces, data entry workflows—this matters enormously. Adding a natural-language command layer to an existing SaaS product becomes a pure frontend problem.

Provider-Agnostic LLM Support

Page-agent ships with no LLM lock-in. The LLMConfig interface accepts a custom baseURL and apiKey, meaning any OpenAI-compatible API endpoint works. The project ships model-specific patches for API variations across major providers.³

Supported as of March 2026:

Provider	Integration Method
OpenAI (GPT-4o, o3)	Native via OpenAI-compatible API
Alibaba Qwen	Dashscope compatible endpoint
Anthropic Claude	Custom API patch
DeepSeek	OpenAI-compatible endpoint
Google Gemini	Compatible mode
Ollama (local)	Local endpoint, no API key

The Ollama support is particularly significant: it makes offline deployment feasible for enterprises with data sovereignty requirements or air-gapped environments.

Comparison: Page-Agent vs. Competing Approaches

	Page-Agent	Playwright	Browser-Use	Stagehand
Deployment	In-page JS	External Node.js	External Python	External Node.js
Session auth	Inherited from browser	Manual credential mgmt	Manual credential mgmt	Manual credential mgmt
Interface method	DOM text extraction	WebDriver API	DOM + screenshot	DOM + screenshot
Vision required	No	No	Optional	Optional
Multi-tab	Extension required	Native	Native	Native
Best for	In-app copilots	CI/CD test automation	Autonomous research agents	Surgical AI actions
GitHub stars (Mar 2026)	~8.6k	~67k	~21k	~8k
License	MIT	Apache 2.0	MIT	MIT

The clearest competitor on the in-app copilot use case is Stagehand, which targets AI-assisted actions within existing automation workflows—but it still operates from outside the browser, requiring Playwright underneath. Page-agent is the only production-grade option that runs purely client-side.

Installation and Integration

npm install page-agent

The library exposes a bookmarklet for quick experimentation, a Chrome extension for multi-tab operation, and the npm package for production integration.⁴

A minimal integration for adding a command panel to an existing app:

import { PageAgent } from 'page-agent';

// Initialize with any OpenAI-compatible endpoint
const agent = new PageAgent({
  apiKey: process.env.OPENAI_API_KEY,
  model: 'gpt-4o',
  // Optional: restrict what the agent can do
  allowList: ['click', 'fill', 'scroll'],
  // Optional: mask sensitive fields before LLM processing
  dataMask: ['input[type="password"]', '.credit-card']
});

// Execute a multi-step workflow via natural language
await agent.execute(
  'Export all overdue invoices from the last 30 days to CSV'
);

The allowList and dataMask options address the two most immediate concerns in production deployments: scope restriction and data privacy.

The Security Picture

Client-side AI agents that read and act on DOM content introduce a meaningful security surface. The most significant risk is indirect prompt injection: malicious content embedded in a webpage (in a comment, a form value, a dynamically loaded advertisement) that instructs the agent to take unintended actions.⁵

Because page-agent operates with the user’s authenticated session, a successful injection can reach any resource that user can access—email, file storage, financial systems. This is not a page-agent-specific problem; it applies to every agentic browser tool. But the in-page model concentrates the risk: there is no sandboxed subprocess, no origin boundary, no separate security context.

The free demo version routes data through servers in mainland China, per independent testing.⁶ Production deployments should use enterprise LLM endpoints where data residency matters.

Page-agent’s built-in human-in-the-loop UI—a thinking panel that surfaces the agent’s reasoning before each action—is the primary mitigation on offer today. For high-stakes workflows, requiring explicit human confirmation at each step substantially reduces the blast radius of a successful injection.

The Real Implication: Every Web App Gets an AI Layer

The deeper significance of page-agent is not the technology itself—DOM manipulation via LLMs is a known pattern—but the deployment model. Previous approaches to AI-powered web automation required infrastructure: a Python server, a headless browser farm, credential management, session synchronization. That infrastructure cost made the pattern viable only for well-resourced teams building bespoke automation.

Page-agent’s in-page, npm-installable model changes the economics. Any SaaS vendor can add a natural language command interface to their product in an afternoon. Any enterprise IT team can wrap an aging internal tool with conversational control without touching the backend.

That is the missing layer this tool provides: a thin, LLM-powered translation layer between human intent and the existing web surface—without rebuilding the surface itself.

Whether the broader ecosystem converges on this in-page model or continues to favor external automation frameworks will likely be determined by the security picture as much as the developer experience. A well-publicized prompt injection incident involving an in-page agent with production access could reset adoption curves quickly.

At 8,600+ stars in a matter of weeks, the appetite is clearly there.

Frequently Asked Questions

Q: Does page-agent require a specific LLM provider? A: No. It supports any OpenAI-compatible endpoint, including Alibaba Qwen, Claude, DeepSeek, Gemini, and Ollama for fully local deployments. You bring your own API key.

Q: How is page-agent different from Playwright or Puppeteer? A: Playwright and Puppeteer automate a separate browser instance from outside, requiring credential management and external infrastructure. Page-agent embeds in the running page, inheriting the user’s authenticated session—making it suitable for in-app copilot use cases rather than CI/CD test automation.

Q: Is page-agent safe to use with sensitive data? A: With caveats. The dataMask configuration can prevent sensitive fields (passwords, card numbers) from reaching the LLM. The allowList configuration restricts what actions the agent can take. The free demo routes data through Alibaba’s servers in China; production use should configure a private LLM endpoint. Prompt injection is a genuine risk on any page that renders untrusted content.

Q: Can page-agent operate across multiple browser tabs? A: The base npm package operates on a single page. Multi-tab coordination requires the optional Chrome extension (@page-agent/extension).

Q: What interfaces can’t page-agent handle? A: It cannot solve CAPTCHAs, interpret content that exists only as images, use keyboard shortcuts, right-click, or reliably type into certain content-editable elements (notably Twitter’s post composer). For visually-dependent workflows, screenshot-based agents like Browser-Use or Stagehand are better alternatives.

Sources:

GitHub. “alibaba/page-agent.” https://github.com/alibaba/page-agent ↩
DeepWiki. “What is PageAgent.” https://deepwiki.com/alibaba/page-agent/1.1-what-is-pageagent ↩
PageAgent.js Official Documentation. “AI-powered GUI Agent.” https://alibaba.github.io/page-agent/ ↩
Gigazine. “I tried using ‘PageAgent,’ which allows you to easily perform various tasks on web pages using AI.” March 6, 2026. https://gigazine.net/gsc_news/en/20260306-pageagent-ai-web-control-interfaces/ ↩
Palo Alto Networks Unit 42. “Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild.” https://unit42.paloaltonetworks.com/ai-agent-prompt-injection/ ↩ ↩²
Gigazine. “I tried using ‘PageAgent.’” March 6, 2026. https://gigazine.net/gsc_news/en/20260306-pageagent-ai-web-control-interfaces/ ↩ ↩²

What Is Page-Agent?

How Does Page-Agent Work?

Why Does the In-Page Architecture Matter?

Provider-Agnostic LLM Support

Comparison: Page-Agent vs. Competing Approaches

Installation and Integration

The Security Picture

The Real Implication: Every Web App Gets an AI Layer

Frequently Asked Questions

Footnotes

Related Articles

Off Grid AI: Running LLMs Completely Offline on Your Phone

Claude Code Plugins: Anthropic's Official Plugin Ecosystem Explained

GitHub Copilot vs Cursor vs Claude Code: The 2026 AI Coding Showdown

Enjoyed this article?