Reference, current as of May 2026

Linux desktop instance API for LLM tool use

If you searched for this you are almost certainly building an agent loop and you want a Linux box the model can drive through tool calls. This page is the short, opinionated map of the four real options as of May 2026, what each one exposes, what they all share, and the one architectural seam none of them solve. The contrast at the end is with a native macOS agent (Fazm), not because it is a substitute, but because the difference shows you when an instance API is the wrong shape for the work.

Matthew Diakonov, Written with AI

Published May 7, 202610 min read

Direct answer (verified May 7, 2026)

Four canonical Linux desktop instance APIs an LLM can drive for tool use:

E2B Desktop Sandbox, open source, Linux with Xfce, Python and TypeScript SDKs. github.com/e2b-dev/desktop.
Scrapybara, hosted (YC), Ubuntu, Browser, and Windows instances, free tier and paid plans from $29/mo. scrapybara.com.
Anthropic computer-use-demo, reference Docker container at XGA 1024x768, VNC on port 5900, Streamlit UI on 8501. github.com/anthropics/anthropic-quickstarts.
Open Computer Use, open source agent loop built on E2B Desktop, designed to work with open source LLMs. github.com/e2b-dev/open-computer-use.

Sources cross-checked at the time of writing: github.com/e2b-dev/desktop, scrapybara.com, anthropic-quickstarts/computer-use-demo.

The shape they all share

Strip the marketing off any Linux desktop instance API and the contract is the same. The agent loop holds a model, the model emits a tool call (move mouse, click, type, screenshot, run command), the API turns it into an action against a virtual machine running a desktop environment, and a result comes back. That is the whole shape. The interesting variation is in billing, in what model the tool call schema is wired to, and in what the cold-start time looks like.

One instance API session

The boxes on the right are real backing implementations. None of the four ship a different mental model; they ship the same loop with different cost surfaces. The architectural decision a team is actually making here is sandbox vs no sandbox, not E2B vs Scrapybara.

The four options, with the trade I would actually make

Picked one for each common shape of work

Linux desktop instance APIs, May 2026

E2B Desktop

Open source, Linux + Xfce, Python and TS SDKs. Pick when you want to read the source and run it self-hosted.

Scrapybara

Hosted, multiple OS images, structured computer tool. Pick when you want to skip ops and pay per hour.

Anthropic ref

Docker container with VNC and Streamlit. Pick when you want the loop Anthropic itself trains against.

Open Computer Use

OSS agent loop on E2B Desktop, designed for open source LLMs. Pick when you cannot send screenshots to a vendor.

1. E2B Desktop Sandbox

github.com/e2b-dev/desktop, Linux with Xfce, MIT licensed

The most read-the-source-friendly option. The repo gives you a sandbox class with a small surface (screenshot, mouse_move, left_click, double_click, scroll, write, press, drag, commands.run, files), Python and TypeScript SDKs, and a customizable template if you do not like Xfce. It runs on E2B's hosted sandbox infrastructure by default; the value of being open is mostly that you can audit the wire format and the shell pipeline rather than that you will host it yourself.

2. Scrapybara

scrapybara.com, Ubuntu / Browser / Windows, hosted

The most production-shaped option in May 2026. Three image types (Ubuntu desktop, headless browser, full Windows), Python and TypeScript SDKs, a structured ComputerTool, BashTool, and EditTool, and a published price ladder: free (10 compute hours, 100 agent credits, 5 concurrent instances), Basic at $29/mo (100 hours, 500 credits, 25 instances), Pro at $99/mo (500 hours, 2,500 credits, 100 instances), Enterprise with self-host. Top-up credits cost $0.04 each. The Windows option is the one that is hard to find elsewhere; everything else is also available on E2B.

3. Anthropic reference computer-use-demo

github.com/anthropics/anthropic-quickstarts, single docker run

Useful primarily because it is the loop Anthropic itself trains the computer-use models against. One docker run gets you a Linux container with VNC on port 5900, a noVNC web client on 6080, the Streamlit agent UI on 8501, and a combined page on 8080. The README is explicit: do not exceed XGA (1024x768) for screenshots, because the model resizes everything down anyway and the click coordinates it returns are inferred from the resized image. Override with -e WIDTH=1920 -e HEIGHT=1080 if you must, then scale on the way in.

4. Open Computer Use

github.com/e2b-dev/open-computer-use

The open source agent loop built on top of E2B Desktop, but wired to swap in non-Anthropic models. Useful when you cannot send screenshots of an internal desktop to a vendor API for policy reasons, and you have an open source vision model on your own infrastructure. The trade is real: open source vision models lag the proprietary ones at fine-grained click targeting, so the same task takes more turns and more failed clicks. Worth it for a regulated workload, not worth it as a default.

What an instance API call actually looks like

The shape is identical across the four. Below is the E2B Desktop flavor, which is the cleanest read on the wire format. Substitute imports and class names for the others; the verbs are the same.

e2b-desktop, one session

The model picks a tool call, you forward it as a method call on the sandbox object, you screenshot again, you feed the new screenshot back to the model, you repeat. There is no state on your side. When the session is done, the VM is destroyed. The next prompt starts in a fresh, empty Linux box. That property is the entire reason the category exists, and it is also the reason it cannot help with the next part of the page.

The seam these APIs do not cross

An instance API gives you a stateless Linux machine. By construction, that machine knows nothing about the user. It has no logged-in browser session, no email, no calendar, no chat threads, no open Xcode project, no Slack DM with the customer who reported the bug. Whatever the agent needs to know about the user, it has to be passed in as text or recreated inside the sandbox.

That is the right shape for some work. Long scrapes. Multi-hour data jobs. Anything where you want a clean room. It is the wrong shape for the kind of work where the user is sitting at the machine, halfway through a task, and the agent needs to finish it. For that case, the agent has to live on the user's box. Which is a different category, with a different failure mode.

“The instance is fresh, every call. Anything the agent needs to know about the user has to come in as a token in the prompt or be reconstructed inside the sandbox. That is the entire deal.”

What the user state inherited by a Linux desktop instance API looks like

Calling a native desktop agent and a Linux desktop instance API substitutes is the most common framing mistake on this topic. They sit on different sides of the access seam. A team picking between them is not picking between two competitors; it is picking which side of the seam its workload lives on.

A failure mode the instance category does not have

On a Linux desktop instance, the provider chose every app on the image. AT-SPI behaves consistently. If the accessibility tree fails, something is genuinely wrong, not just a quirky app. On a native macOS agent running on the user's machine, the agent has no such guarantee. Most native AppKit apps publish a clean accessibility tree. Most Catalyst apps do. Most Electron apps do, through a shim. But Qt apps, OpenGL apps, and Python-bound apps frequently return AXError.cannotComplete for reasons that have nothing to do with system permissions. The named example in the source is PyMOL.

The native agent has to disambiguate. Fazm's check lives at lines 480 to 534 of Desktop/Sources/AppState.swift. The first probe is a normal AX call against the frontmost app. On .cannotComplete, a second probe runs against Finder, a known good AX target. If Finder also fails, the system permission is genuinely broken and the user gets a notification. If Finder succeeds, the original failure was app-specific, the agent skips the AX path for that one app, and everything else continues. It is one of those places where the right behavior is more interesting than the happy path.

What the disambiguation looks like in /tmp/fazm.log

Read this not as a Fazm-specific implementation note. Read it as the kind of failure mode that can only exist when the agent runs on hardware the agent vendor does not control. A Linux desktop instance API has to handle a lot of things, but it does not have to handle this one. That is part of why the instance category is so much easier to ship as a clean SDK.

Picking one

The honest decision tree, for the actual workloads I see teams pick this category for:

You want to read the wire format and self-host eventually. Pick E2B Desktop. It is the cleanest open source option and the base layer under Open Computer Use.
You want it shipped today and you do not want to run any infra. Pick Scrapybara. Free tier is enough to prototype, the price ladder above it is honest, and the multi-OS support is genuinely useful if your prompt happens to need a Windows box.
You want the loop the model was trained against. Pick the Anthropic reference container. Lowest model-side surprise.
You cannot send screenshots of internal state to a vendor API. Pick Open Computer Use, accept the higher per-task turn count from the open source vision model.
The work is on the user's actual machine. None of the above will work, by construction. On macOS, look at the native-agent category (Fazm is the open source option I maintain). On Windows, the UIAutomation-based agents are the equivalent, and they are far less mature. On Linux, AT-SPI agents running locally are the equivalent, and again, less mature than the instance category.

Building on the wrong side of the seam?

A 20-minute call where we walk through your agent loop and figure out whether it should live on a Linux desktop instance or on the user's actual machine. Bring the prompt. We will tell you if you picked wrong.

Frequently asked questions

What is a Linux desktop instance API for LLM tool use, in one sentence?

A service or library that gives your agent loop a fresh Linux machine with a graphical desktop, exposed through a programmatic interface (start/stop, screenshot, click, type, run a shell command, list files), so an LLM with computer-use tool calls can act on a real OS without touching your laptop.

Which APIs actually exist as of May 2026?

Four that I would call canonical. E2B Desktop Sandbox, open source at github.com/e2b-dev/desktop, Linux with Xfce, Python and TypeScript SDKs. Scrapybara, hosted out of Y Combinator, with Ubuntu, Browser, and Windows instance types and a published price list of free, $29, $99, and Enterprise. Anthropic's reference computer-use-demo Docker container, the one their team ships at github.com/anthropics/anthropic-quickstarts under computer-use-demo, with VNC on port 5900 and a Streamlit UI on 8501. And E2B's Open Computer Use, a separate repo specifically for driving the E2B Desktop with open source LLMs. There are smaller variants and forks; everyone else worth naming is a wrapper over one of these.

Why do these all run Linux specifically?

Because Linux is the only desktop OS you can legally and cheaply spin up by the thousand inside someone else's cloud. Windows desktop instances exist (Scrapybara has them) but the licensing math is meaner, and macOS instances effectively cannot exist this way at all because Apple's licensing only allows macOS guests on Apple hardware. So the entire instance category is shaped by the Linux license: it is the cheapest disposable desktop you can build a per-call billing model on top of.

What screen resolution should I run the instance at?

Anthropic's own README is explicit about this: do not send the model screenshots above XGA. The default in their reference container is 1024x768. You can override with WIDTH and HEIGHT environment variables, but if you exceed XGA the recommended pattern is to scale down before sending to the model, let it interact with the scaled image, and map coordinates back proportionally. The reason is image resizing on the model side: high-resolution captures get downsampled anyway, and the click coordinates the model returns are inferred from the size it actually saw.

How is this different from a native desktop agent on the user's machine?

An instance API gives the agent a fresh, stateless Linux box that has nothing in it, by design. A native desktop agent runs on the user's actual logged-in machine and inherits everything the user is already authenticated to, including their browser session, their email, their calendar, their open Slack threads. The two solve different problems. Instance APIs are right when the model needs a sandbox to do work that nobody is watching (a long scrape, a destructive experiment, a multi-hour data job). A native agent is right when the model needs to act on real state the user is in the middle of. Calling them substitutes is the most common mistake in this space.

What does each instance API actually expose?

All four expose the same five-ish primitives: take a screenshot, move the mouse, click, type a string, and run a shell command. Scrapybara also exposes a structured 'computer' tool that wraps these into a single Anthropic-compatible function-call schema; E2B Desktop's SDK gives you a sandbox object with .screenshot(), .mouse_move(), .left_click(), .write(), .run_command(); the Anthropic reference container exposes them as the computer_20241022 tool spec the model is trained against. The differences between the APIs are mostly auth, billing, and which tool spec your loop is already wired to.

What does Fazm have to do with any of this?

Fazm is in the other category, the native one, on macOS only. It does not give you a Linux desktop instance. It is a Mac app that uses the macOS Accessibility API on the user's actual machine. The point of mentioning it on a page about Linux desktop instance APIs is to draw the line. If your prompt is 'do something on a fresh Linux box', pick one of the four above. If your prompt is 'do something in the user's real Chrome and their real Mac apps', no Linux instance API can reach there. The honest contrast also surfaces a failure mode native agents have to handle that instance APIs do not, which is at AppState.swift:517 and described below.

What is the failure mode you keep referencing at AppState.swift:517?

On a Linux desktop instance you control the entire OS image, so the AT-SPI accessibility tree is consistent. On a user's Mac you do not. Some apps publish a clean accessibility tree (most native AppKit apps, most Catalyst apps, most Electron apps via the AX shim). Other apps return AXError.cannotComplete because they are Qt, OpenGL, or Python bindings that never wired up the AX shims, with PyMOL being the named example in the source. Fazm's AppState.swift function testAccessibilityPermission cannot tell from a single .cannotComplete whether the system permission is genuinely broken or whether the frontmost app just does not implement AX. So it falls back to running the same probe against Finder, which is a known good AX target, and only declares the permission broken if Finder also fails. Lines 480 to 534 of AppState.swift in github.com/m13v/fazm. There is no equivalent to this in any Linux desktop instance API because the provider chose every app on the image.

Are any of the four open source?

E2B Desktop and Open Computer Use are. The Anthropic reference container is open source code wise, but the agent loop you point at it can call a paid Anthropic, AWS Bedrock, or Vertex API. Scrapybara has self-hosting on the Enterprise tier per their pricing page, otherwise it is a hosted SaaS. Fazm is also open source and MIT-licensed at github.com/m13v/fazm, but again, different category, native macOS, not a Linux instance.

What is the cheapest way to get started?

Spin up Anthropic's reference Docker container locally. One docker run, ANTHROPIC_API_KEY in an env var, and you have a Linux desktop with a Streamlit UI, a VNC server, and an agent loop that drives Claude on it. The model API tokens are the only cost. After that, if you want hosted infrastructure, Scrapybara has a free tier of 10 compute hours per month and E2B Desktop's hosted version comes with the broader E2B free tier. The cost ladder is roughly: local Docker free, hosted free tier for prototyping, $29 to $99 for production-ish use, Enterprise from there.

Adjacent reading

Mechanism

Accessibility API vs screenshot agents, the real cost difference

What changes when an agent reads a structured tree instead of a 1024x768 PNG, and why the bills come out different.

Read

Architecture

macOS AI code agent, when the agent reaches past the editor

The same native-vs-instance seam, framed as a coding agent that operates the user's real browser and Mac apps.

Read

Safety

Securing AI desktop agents, sandboxing, permissions, and supply chain defense

The other reason teams pick instance APIs: the blast radius is bounded by the VM, not by the user's open Slack thread.

Read