Cost Optimization Guide

Playwright + AI: MCP vs CLI vs Agents - Token Cost Optimization Guide

Browser automation with AI is no longer experimental - it is a daily workflow for testing, scraping, and web interaction. But the three main approaches (Playwright MCP server, Playwright CLI tools, and full agent-based browser control) have dramatically different token usage profiles. A task that costs $0.02 via CLI can cost $2.00 via an agent with screenshots. This guide provides real token measurements, cost comparisons, and optimization techniques for each approach.

1. The Three Approaches Explained

Playwright MCP Server

The Playwright MCP server exposes browser automation as tools that an AI agent can call. The agent decides when to navigate, click, type, or take snapshots. The MCP server handles the browser instance and returns structured results (DOM snapshots, accessibility trees, or screenshots).

Key characteristic: the AI model is in the loop for every decision. Each action requires a model inference call to decide the next step. This provides flexibility but costs tokens for every interaction.

Playwright CLI / Script

The traditional approach: write a Playwright script (JavaScript/Python/etc.) that defines the exact sequence of actions. The AI agent writes the script once, then the script runs without any model calls. Deterministic, fast, and token-efficient.

Key characteristic: the AI model is only involved in writing the script. Execution is model-free. This is the cheapest approach but requires knowing the exact steps in advance.

Full Agent Browser Control

Tools like Anthropic's Computer Use or browser agent frameworks that give the AI full control over a browser through screenshots. The agent sees the screen, decides where to click, observes the result, and continues. This is the most flexible approach and the most expensive.

Key characteristic: every step involves sending a screenshot (hundreds of KB to MB of image data) to the model, plus the model's reasoning about what to do next. Token usage scales with the number of steps and screen complexity.

2. Token Usage: Real Measurements

Here are measured token costs for a standard test task: navigating to a web app, logging in, creating a new item, verifying it appears in a list, and logging out. Approximately 8-12 browser interactions.

Approach	Input Tokens	Output Tokens	Approx Cost (Sonnet)	Time
CLI script (write + run)	~8,000	~2,000	$0.03-0.05	15-30s
MCP (snapshot mode)	~40,000	~5,000	$0.15-0.25	30-90s
MCP (screenshot mode)	~200,000	~8,000	$0.70-1.20	60-180s
Full agent (Computer Use)	~400,000	~15,000	$1.50-3.00	120-300s

The difference is dramatic. CLI scripting is 30-100x cheaper than full agent control. MCP with snapshots sits in the middle - more flexible than scripts, much cheaper than screenshots.

The token cost difference comes primarily from input size. A DOM snapshot is typically 2,000-5,000 tokens. A screenshot encoded as base64 is 15,000-50,000 tokens depending on resolution. An accessibility tree is typically 1,000-3,000 tokens. These costs multiply by the number of steps.

3. When to Use Each Approach

Scenario	Best Approach	Why
Repeated test suite	CLI script	Write once, run many times at zero token cost
One-off data extraction	MCP (snapshot)	Flexible, no script maintenance needed
Visual QA checking	MCP (screenshot) or agent	Need visual verification of rendered output
Dynamic/unknown UI	Full agent	Agent adapts to unexpected layouts
Form filling	MCP (snapshot)	DOM structure sufficient, no visual needed
Cross-app workflow	Desktop agent	Browser + native apps need unified control

The general rule: use the cheapest approach that works for your use case. Default to CLI scripts for anything repeatable. Use MCP snapshots for ad-hoc tasks. Reserve screenshots and full agent control for visual tasks and dynamic UIs.

4. Optimizing Playwright MCP Token Usage

If you are using the Playwright MCP server, several techniques can reduce token consumption significantly:

Use snapshots instead of screenshots - The Playwright MCPbrowser_snapshot tool returns an accessibility tree representation of the page. This is 5-20x smaller than a screenshot while providing all the information needed for most interactions.
Minimize snapshot frequency - You do not need a snapshot after every click. Navigate, perform multiple actions, then snapshot to verify the result.
Use element references - After a snapshot, use the[ref=eN] element references for subsequent clicks and typing instead of re-snapshotting to find elements.
Batch actions - Usebrowser_fill_form for multiple form fields at once instead of typing into each field individually.
Targeted reads - If you only need to check a specific part of the page, use JavaScript evaluation (browser_evaluate) to extract just the data you need rather than snapshotting the entire page.

Teams applying these optimizations typically see 40-60% reduction in token usage compared to naive MCP usage.

5. Snapshots vs Screenshots: The Cost Difference

This distinction deserves its own section because it is the single biggest cost lever in AI browser automation.

A snapshot is a structured text representation of the page. The Playwright MCP server generates this from the accessibility tree, producing output like:

- heading "Dashboard" [ref=e1]
- navigation "Main menu" [ref=e2]
  - link "Home" [ref=e3]
  - link "Settings" [ref=e4]
- main [ref=e5]
  - table "Recent items" [ref=e6]
    - row "Item 1 - Active" [ref=e7]
    - row "Item 2 - Pending" [ref=e8]
  - button "Create New" [ref=e9]

This is typically 1,000-5,000 tokens. It contains all the information needed to navigate, click, and type.

A screenshot is a base64-encoded PNG image. Even at reduced resolution (1280x720), this is typically 15,000-50,000 tokens. At full resolution, it can be 80,000+ tokens.

When you need screenshots:

Visual regression testing (comparing rendered appearance)
Canvas or WebGL content that has no DOM representation
Image verification (checking that the right image loaded)
CSS layout debugging

For everything else - navigation, form filling, data extraction, link clicking - snapshots are sufficient and dramatically cheaper.

6. The Accessibility API Alternative

Beyond Playwright's browser-specific approach, there is a fundamentally different method for UI automation: using the operating system's accessibility APIs directly.

On macOS, the accessibility framework exposes the entire UI tree of every application - including browsers, but also native apps, Electron apps, and anything else running on the system. This tree contains element roles, labels, values, positions, and states.

Advantages over Playwright for browser automation:

Works across all applications - Not limited to browser tabs. An accessibility-based agent can interact with the browser AND native apps in the same workflow.
Resolution independent - The accessibility tree does not change with screen resolution, Retina scaling, or window size. Playwright screenshots need to account for these.
Consistent token cost - The accessibility tree size depends on UI complexity, not pixel count. A complex web app might produce a 3,000 token tree regardless of whether the screen is 1080p or 4K.
Faster execution - No rendering or screenshot encoding overhead. Actions are dispatched directly through the accessibility API.

Desktop agents like Fazm use accessibility APIs as their primary interaction method. This makes them particularly token-efficient for UI automation tasks compared to screenshot-based approaches like Computer Use. The tradeoff is that accessibility APIs require OS-level permissions and are platform-specific (macOS only for Fazm, though similar APIs exist on Windows and Linux).

7. Setting Up Cost Budgets for Browser Automation

To prevent cost surprises, set up explicit budgets for browser automation workflows:

Per-task limits - Cap the maximum number of browser interactions per task. For most web automation tasks, 20-30 interactions is sufficient. If the agent exceeds this, it is likely stuck.
Token tracking - Log token usage per automation run. Build a baseline for your common tasks and alert when usage exceeds 2x the baseline.
Approach routing - Automatically route tasks to the cheapest viable approach. If a task can be expressed as a CLI script, generate the script instead of using MCP. Use MCP snapshots unless the task specifically requires visual verification.
Model selection - Use Haiku or Sonnet for simple navigation tasks. Reserve Opus for complex reasoning about page content. This alone can reduce costs 3-5x.
Caching - Cache page snapshots when navigating known paths. If you know the login page layout, there is no need to snapshot it every time.

A reasonable budget for a team doing daily browser automation:

CI/CD test automation (CLI scripts): $1-5/month
Ad-hoc MCP-based automation: $20-50/month per developer
Full agent browser control: $50-200/month per developer
Desktop agent workflows: $30-80/month per developer

Token-Efficient Desktop Automation

Fazm uses accessibility APIs instead of screenshots, keeping token costs low while automating browser and native macOS applications.

Try Fazm Free