Alibaba’s page-agent is a JavaScript library that embeds an AI agent directly into any web page, enabling natural language control of the DOM—no browser extensions, Python scripts, or headless Chrome instances required. With 8,600+ GitHub stars accumulated since its public release, it represents a meaningfully different approach to web automation: the missing middleware layer between LLMs and the existing web.
What Is Page-Agent?
Page-agent is an open-source TypeScript library published by Alibaba that turns any web interface into a surface an LLM can operate. The project, hosted at alibaba/page-agent on GitHub, is MIT-licensed and available via npm.
Unlike headless browser frameworks such as Playwright or Puppeteer—which automate a separate browser instance from outside—page-agent embeds directly into the running page via a <script> tag or npm import. The agent lives inside the user’s browser session. It sees the DOM the user sees. It acts with the permissions the user already has.
The practical consequence: a developer can ship an AI copilot into an existing web application with roughly a dozen lines of code and no backend rewrite.
import { PageAgent } from 'page-agent';
const agent = new PageAgent({ model: 'qwen3.5-plus', baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1'});
await agent.execute('Find the highest-priority open ticket and assign it to Alice');As of March 2026, the project is on v1.5.7 across 22 releases, with 675 forks and active development.1
How Does Page-Agent Work?
The architecture implements a tight Observe–Think–Act loop across three stages:
- Observe: The
PageControllercomponent extracts the current DOM state, converting the page into a simplified HTML representation with indexed interactive elements stripped of visual noise. - Think: This text representation, combined with task history, is passed to the configured LLM. The model reasons about what action to take next.
- Act: Selected tools execute synthetic DOM operations—clicks, form fills, scrolls, navigation—against the live page.
Each step issues a fresh LLM call with updated page state, making the system reactive to dynamic changes like modals, loading spinners, and paginated tables.
The project is a TypeScript monorepo with seven packages: page-agent (user-facing), @page-agent/core, @page-agent/llms, @page-agent/page-controller, @page-agent/ui, @page-agent/extension (for multi-tab coordination), and @page-agent/website.2
Notable: the project’s documentation acknowledges that its DOM processing components derive from the browser-use project (Copyright 2024 Gregor Zunic, MIT Licensed).
Why Does the In-Page Architecture Matter?
Every other major web automation approach runs outside the browser:
- Playwright/Puppeteer: Automate a separate browser instance via WebDriver; require Node.js infrastructure and explicit credential management
- Browser-use: Python-based agent framework running in a headless Chrome process
- OpenAI’s Computer Use / Anthropic’s Computer Use: Screenshot-based visual agents operating from the OS level
Page-agent’s in-page position eliminates an entire category of operational friction. The agent inherits the authenticated session already open in the user’s browser. There is no separate credential store, no cookie synchronization problem, no TLS interception layer to maintain.
For developers shipping internal tooling—think ERP dashboards, customer support interfaces, data entry workflows—this matters enormously. Adding a natural-language command layer to an existing SaaS product becomes a pure frontend problem.
Provider-Agnostic LLM Support
Page-agent ships with no LLM lock-in. The LLMConfig interface accepts a custom baseURL and apiKey, meaning any OpenAI-compatible API endpoint works. The project ships model-specific patches for API variations across major providers.3
Supported as of March 2026:
| Provider | Integration Method |
|---|---|
| OpenAI (GPT-4o, o3) | Native via OpenAI-compatible API |
| Alibaba Qwen | Dashscope compatible endpoint |
| Anthropic Claude | Custom API patch |
| DeepSeek | OpenAI-compatible endpoint |
| Google Gemini | Compatible mode |
| Ollama (local) | Local endpoint, no API key |
The Ollama support is particularly significant: it makes offline deployment feasible for enterprises with data sovereignty requirements or air-gapped environments.
Comparison: Page-Agent vs. Competing Approaches
| Page-Agent | Playwright | Browser-Use | Stagehand | |
|---|---|---|---|---|
| Deployment | In-page JS | External Node.js | External Python | External Node.js |
| Session auth | Inherited from browser | Manual credential mgmt | Manual credential mgmt | Manual credential mgmt |
| Interface method | DOM text extraction | WebDriver API | DOM + screenshot | DOM + screenshot |
| Vision required | No | No | Optional | Optional |
| Multi-tab | Extension required | Native | Native | Native |
| Best for | In-app copilots | CI/CD test automation | Autonomous research agents | Surgical AI actions |
| GitHub stars (Mar 2026) | ~8.6k | ~67k | ~21k | ~8k |
| License | MIT | Apache 2.0 | MIT | MIT |
The clearest competitor on the in-app copilot use case is Stagehand, which targets AI-assisted actions within existing automation workflows—but it still operates from outside the browser, requiring Playwright underneath. Page-agent is the only production-grade option that runs purely client-side.
Installation and Integration
npm install page-agentThe library exposes a bookmarklet for quick experimentation, a Chrome extension for multi-tab operation, and the npm package for production integration.4
A minimal integration for adding a command panel to an existing app:
import { PageAgent } from 'page-agent';
// Initialize with any OpenAI-compatible endpointconst agent = new PageAgent({ apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o', // Optional: restrict what the agent can do allowList: ['click', 'fill', 'scroll'], // Optional: mask sensitive fields before LLM processing dataMask: ['input[type="password"]', '.credit-card']});
// Execute a multi-step workflow via natural languageawait agent.execute( 'Export all overdue invoices from the last 30 days to CSV');The allowList and dataMask options address the two most immediate concerns in production deployments: scope restriction and data privacy.
The Security Picture
Client-side AI agents that read and act on DOM content introduce a meaningful security surface. The most significant risk is indirect prompt injection: malicious content embedded in a webpage (in a comment, a form value, a dynamically loaded advertisement) that instructs the agent to take unintended actions.5
Because page-agent operates with the user’s authenticated session, a successful injection can reach any resource that user can access—email, file storage, financial systems. This is not a page-agent-specific problem; it applies to every agentic browser tool. But the in-page model concentrates the risk: there is no sandboxed subprocess, no origin boundary, no separate security context.
The free demo version routes data through servers in mainland China, per independent testing.6 Production deployments should use enterprise LLM endpoints where data residency matters.
Page-agent’s built-in human-in-the-loop UI—a thinking panel that surfaces the agent’s reasoning before each action—is the primary mitigation on offer today. For high-stakes workflows, requiring explicit human confirmation at each step substantially reduces the blast radius of a successful injection.
The Real Implication: Every Web App Gets an AI Layer
The deeper significance of page-agent is not the technology itself—DOM manipulation via LLMs is a known pattern—but the deployment model. Previous approaches to AI-powered web automation required infrastructure: a Python server, a headless browser farm, credential management, session synchronization. That infrastructure cost made the pattern viable only for well-resourced teams building bespoke automation.
Page-agent’s in-page, npm-installable model changes the economics. Any SaaS vendor can add a natural language command interface to their product in an afternoon. Any enterprise IT team can wrap an aging internal tool with conversational control without touching the backend.
That is the missing layer this tool provides: a thin, LLM-powered translation layer between human intent and the existing web surface—without rebuilding the surface itself.
Whether the broader ecosystem converges on this in-page model or continues to favor external automation frameworks will likely be determined by the security picture as much as the developer experience. A well-publicized prompt injection incident involving an in-page agent with production access could reset adoption curves quickly.
At 8,600+ stars in a matter of weeks, the appetite is clearly there.
Frequently Asked Questions
Q: Does page-agent require a specific LLM provider? A: No. It supports any OpenAI-compatible endpoint, including Alibaba Qwen, Claude, DeepSeek, Gemini, and Ollama for fully local deployments. You bring your own API key.
Q: How is page-agent different from Playwright or Puppeteer? A: Playwright and Puppeteer automate a separate browser instance from outside, requiring credential management and external infrastructure. Page-agent embeds in the running page, inheriting the user’s authenticated session—making it suitable for in-app copilot use cases rather than CI/CD test automation.
Q: Is page-agent safe to use with sensitive data?
A: With caveats. The dataMask configuration can prevent sensitive fields (passwords, card numbers) from reaching the LLM. The allowList configuration restricts what actions the agent can take. The free demo routes data through Alibaba’s servers in China; production use should configure a private LLM endpoint. Prompt injection is a genuine risk on any page that renders untrusted content.
Q: Can page-agent operate across multiple browser tabs?
A: The base npm package operates on a single page. Multi-tab coordination requires the optional Chrome extension (@page-agent/extension).
Q: What interfaces can’t page-agent handle? A: It cannot solve CAPTCHAs, interpret content that exists only as images, use keyboard shortcuts, right-click, or reliably type into certain content-editable elements (notably Twitter’s post composer). For visually-dependent workflows, screenshot-based agents like Browser-Use or Stagehand are better alternatives.
Sources:
- alibaba/page-agent — GitHub
- What is PageAgent — DeepWiki
- PageAgent.js Official Site
- PageAgent: Alibaba’s Answer to Controlling Any Web App With Plain English — TopAIProduct
- One Line of Code, Total Web Control — HumanaAI Substack
- I tried using PageAgent — GIGAZINE
- Page-Agent: Alibaba’s Open Source AI Web Copilot — Emelia.io
- Fooling AI Agents: Web-Based Indirect Prompt Injection — Palo Alto Unit 42
- Stagehand vs Browser Use vs Playwright: AI Browser Automation Compared — NxCode
- Page Agent — EveryDev.ai
- Beyond Pixels: DOM Downsampling for LLM-Based Web Agents — arXiv
- Mitigating Prompt Injections in Browser Use — Anthropic
Footnotes
-
GitHub. “alibaba/page-agent.” https://github.com/alibaba/page-agent ↩
-
DeepWiki. “What is PageAgent.” https://deepwiki.com/alibaba/page-agent/1.1-what-is-pageagent ↩
-
PageAgent.js Official Documentation. “AI-powered GUI Agent.” https://alibaba.github.io/page-agent/ ↩
-
Gigazine. “I tried using ‘PageAgent,’ which allows you to easily perform various tasks on web pages using AI.” March 6, 2026. https://gigazine.net/gsc_news/en/20260306-pageagent-ai-web-control-interfaces/ ↩
-
Palo Alto Networks Unit 42. “Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild.” https://unit42.paloaltonetworks.com/ai-agent-prompt-injection/ ↩ ↩2
-
Gigazine. “I tried using ‘PageAgent.’” March 6, 2026. https://gigazine.net/gsc_news/en/20260306-pageagent-ai-web-control-interfaces/ ↩ ↩2