A read from the other side of the computer-use line

The April 2026 OpenAI API changelog, one phrase at a time, from someone shipping a desktop agent that does not use screenshots

OpenAI shipped three things to the developer API in April 2026. An Agents SDK update on April 15. GPT Image 2 on April 21. GPT-5.5 with a 1M token context, image input, hosted shell, apply patch, Skills, MCP, web search, and the line that matters most for anyone building agents that touch real apps: built-in computer use.

Most of what you will read about this changelog is a re-list of those bullet points. This is not that. I have spent April 2026 shipping a Mac desktop agent (Fazm 2.4.0 through 2.6.4, all in this month, all visible in CHANGELOG.json on GitHub) that drives the browser, Mail, Notes, Calendar, Messages, and Google Workspace through a path OpenAI's new release does not take. So I want to walk this changelog in order and stop at the GPT-5.5 entry to explain what "built-in computer use" actually grounds against, and what the other path does that the new path does not.

M
Matthew Diakonov
11 min read

The dates and what each one was

Three entries on developers.openai.com/changelog carry an April 2026 date. They are not equally interesting for the same audience. If you build agents that touch a real computer, the first and the third will eat most of your attention.

1

April 15: Agents SDK update

Three additions: running agents in controlled sandboxes, inspecting and customizing the open-source harness, and controlling when memories are created and where they are stored.

This is the foundation for everything else on the month. The sandbox is the server-side virtual desktop where a hosted shell or computer-use harness runs without touching your actual laptop. The open-source harness is the code that takes a model output, performs the click or the keystroke, and feeds the next observation back. "Customizable" is load-bearing here: it means you can swap in a different grounding signal if you want to.

2

April 21: GPT Image 2

Image generation and editing. Flexible image sizes, high-fidelity image inputs, token-based image pricing, Batch API support at a 50 percent discount.

Orthogonal to the agent-on-a-Mac thread, but useful if your agent makes images. Token-based pricing makes the cost easy to reason about. Batch at half price makes the "agent generates a hundred variants overnight" pattern cheap.

3

April 24: GPT-5.5 and GPT-5.5 pro

1M token context, image input, structured outputs, function calling, prompt caching, Batch, tool search, built-in computer use, hosted shell, apply patch, Skills, MCP, web search. Reasoning effort defaults to medium. Extended prompt caching is the only caching method supported.

This is the entry the rest of this guide spends time on, specifically the phrase "built-in computer use". The model is trained to take a screenshot in the prompt and return a grounded action. The action vocabulary lives in the model. The harness from April 15 is what turns those actions into real OS events. That part is open source. The grounding signal is pixels.

What "built-in computer use" actually grounds against

The phrase reads like one capability. Architecturally it is two things in a trench coat: a model that takes screenshots in the prompt, and a harness that performs the actions the model emits. Both are needed. Neither alone constitutes "the agent".

When you call the Responses API with computer use enabled, this is what runs each step:

GPT-5.5 built-in computer use, one step

Your harnessOS / displayGPT-5.5 APIscreencapture()PNG of the desktopPOST /responses { tools: [computer_use], image: <png> }action: click(x=812, y=540)synthesize mouse event at (812, 540)click landed

The screenshot is the grounding signal. The model decides where to click by looking at pixels. This is exactly the right shape for an unfamiliar app, an arbitrary game, an OpenGL canvas, a remote desktop, anything that does not surface structured semantics. It is also the right shape for "run this on a fresh sandboxed VM" because the VM has no apps installed yet and the agent has to discover them visually.

It is not the only grounding signal a desktop agent can use. The OS already maintains a structured tree of every UI element on screen. On macOS that tree is the accessibility hierarchy, the same one VoiceOver and Switch Control rely on. An agent that reads from that tree never needs a screenshot to find a button.

The other path: an accessibility tree, in two real files

I want to make this concrete enough that you can verify it yourself on a clean checkout. The Fazm desktop app is open source on GitHub. The two files that define the not-screenshots path are in there.

File 1

acp-bridge/src/index.ts, lines 1058 to 1066

The agent runtime registers MCP servers it can call. The third one in the list is macos-use, a native binary referenced earlier in the file at line 63 as Contents/MacOS/mcp-server-macos-use. The comment on the registration block reads "native macOS accessibility automation". That binary is what answers when the agent decides to click a button in Mail or read a row in a spreadsheet. It does not screenshot anything to do that.

File 2

Desktop/Sources/AppState.swift, line 439

The Swift host process runs a startup probe to verify that the accessibility permission is actually wired up. It does that by calling AXUIElementCreateApplication(frontApp.processIdentifier) and then AXUIElementCopyAttributeValue(appElement, kAXFocusedWindowAttribute as CFString, &focusedWindow). That is the same call shape the macos-use MCP server uses to read every visible UI element. There is no CGImage in this path, no JPEG encoding, no token cost paid for pixels in the prompt.

Same problem (drive a real Mac), totally different grounding signal. One puts a PNG into a 1M-token context window and asks for coordinates. The other asks the OS for a tree of strings and IDs and clicks the element by reference. The April 24 release does not invalidate the second approach, it just means the model provider now ships the first one as a primitive. That is useful. It is also a clean place to draw the line between the two architectures, which most other writeups of this changelog do not.

Where each grounding signal wins

The honest answer is that it depends on the app, the task, and where the agent runs.

Pixel grounding wins when

  • The target app exposes no accessibility metadata: OpenGL canvases, a lot of games, custom-rendered UI
  • You are running on a fresh sandbox VM where you do not even know what apps are installed
  • The task is visual by nature: judging a layout, reading a chart, editing an image
  • You are remoting into someone else's desktop and only have video of it

Accessibility-tree grounding wins when

  • The agent runs locally and the user wants Mail, Calendar, Notes, Messages, Slack, browser, and Google Workspace driven on their actual logged-in Mac
  • You need deterministic targeting: a button by role + title, not coordinates that change with DPI or font scale
  • You care about latency and cost per step: a tree query is small text, not a tokenized PNG
  • You want the agent to keep working when the screen is occluded, the laptop is on a clamshell, or a different Space is foregrounded

A serious agent product picks both. Fazm bundles playwright and macos-use and a capture_screenshot tool, registered side by side in acp-bridge/src/index.ts. The accessibility tree is the cheap default. The screenshot is the escape hatch when the user explicitly asks "what is on my screen". Pixel grounding is welcome whenever the task needs it. The point is that the model provider shipping computer use as a primitive is one signal, not the whole stack.

What I am not going to do is panic about this release

The shape of the GPT-5.5 announcement makes computer use sound like the headline feature. From a builder's chair it is one of eleven new capabilities in the same line item. The other ten are arguably more impactful for normal API workloads: 1M context, structured outputs, prompt caching, Batch, tool search, hosted shell, apply patch, Skills, MCP, web search. If your agent today uses Claude or any other Anthropic-compatible gateway through a custom API endpoint, the GPT-5.5 release is not a forced migration. It is a point on a curve.

The phrase that kept me thinking the most is the April 15 line about "inspecting and customizing the open-source harness". That is the part that, over time, lets a model provider's notion of computer use become an industry shape rather than a single product surface. The harness becomes the place where pixel grounding and tree grounding can both live. The model can return either a coordinate or a target by id. Today GPT-5.5 is mostly the first one. Tomorrow it does not have to be.

A small note on what I shipped this April, on the other path

Because it is fair to put work next to opinion: while OpenAI was shipping Agents SDK sandboxes and GPT Image 2 and GPT-5.5, the Fazm desktop app shipped seven public versions in April 2026, from 2.4.0 on April 20 through 2.6.4 on April 29. The full list is in CHANGELOG.json in the repo. None of those releases involved a screenshot loop. They involved fixes to pop-out chat windows, session recovery after rate-limit errors, voice transcription biasing for product names, and a model-id migration after an SDK rename. The accessibility-tree path is not glamorous. It is just the path the agent took every time it opened Mail, drafted a reply, scheduled a meeting, or filled in a row of a spreadsheet.

Want to see the not-screenshots path running on your own Mac?

I run a 25-minute call to walk through how a tree-grounded agent handles real workflows: invoicing, CRM, scheduling, the boring stuff that eats your week.

Frequently asked questions

What did OpenAI ship to the API in April 2026?

Three confirmed entries on the developer changelog. April 15: an Agents SDK update that added running agents in controlled sandboxes, inspecting and customizing the open-source harness, and controlling when memories are created and where they are stored. April 21: GPT Image 2, a new image generation and editing model with flexible image sizes, high-fidelity image inputs, token-based image pricing, and Batch API support at a 50 percent discount. April 24: GPT-5.5 to the Chat Completions and Responses API, plus GPT-5.5 pro for more computationally intensive tasks. GPT-5.5 advertises a 1M token context window, image input, structured outputs, function calling, prompt caching, Batch, tool search, built-in computer use, hosted shell, apply patch, Skills, MCP, and web search. Reasoning effort defaults to medium. Extended prompt caching is the only caching method supported.

What does 'built-in computer use' actually mean in GPT-5.5?

It means the model is trained to look at a screenshot of a computer screen, in pixels, and emit grounded actions like click(x, y), type(text), or scroll(delta). The harness sends a screenshot back to the API in the prompt, the model returns coordinates, the harness performs the click and screenshots again. The 'built-in' part is that the action vocabulary and the screenshot-conditioning are baked into the model. The harness is open source and inspectable per the April 15 Agents SDK note. The grounding signal is still raw pixels: pixels in, pixel-coordinate actions out.

How is the accessibility-API path different from screenshot-grounded computer use?

An accessibility-API agent does not look at pixels. It queries the operating system for the structured tree the OS already maintains for screen readers, automation, and assistive tech. On macOS the call is AXUIElementCreateApplication for a process, then AXUIElementCopyAttributeValue for kAXFocusedWindowAttribute, kAXChildrenAttribute, kAXTitleAttribute, kAXValueAttribute, and kAXRoleAttribute. The returned values are strings, not pixels. The agent reasons about a textual tree of buttons, text fields, menu items, and rows. Actions target elements by reference, not by coordinate. There is no JPEG in the prompt for that branch. The two architectures both call themselves 'computer use' and both can drive a Mac, but they ground against entirely different signals.

Where can I see the accessibility-API approach in real shipping code?

Look at acp-bridge/src/index.ts in the open source Fazm repo. Line 63 names a binary path: Contents/MacOS/mcp-server-macos-use. Lines 1058 to 1066 register that binary as an MCP server called macos-use, alongside playwright for browser automation and a Catalyst-app whatsapp-mcp. Then look at Desktop/Sources/AppState.swift at line 439, which calls AXUIElementCreateApplication(frontApp.processIdentifier) and follows up with AXUIElementCopyAttributeValue for kAXFocusedWindowAttribute. That is the entire grounding path: process id to AXUIElement, attribute by attribute, no screen capture involved unless the user explicitly asks 'what is on my screen' and a separate capture_screenshot tool is invoked.

Does the GPT-5.5 'built-in computer use' release replace tools like Fazm?

Not really, because they ground against different signals. A pixel-grounded agent like the GPT-5.5 reference harness can in principle drive any application, including one that ships zero accessibility metadata. An accessibility-tree agent cannot drive an OpenGL canvas, a Qt app with no AX hooks, or a custom-rendered game. On the other hand the accessibility-tree path is faster, deterministic, robust to font rendering and DPI changes, and does not need a screenshot in the prompt for every step. For long Mac workflows that are mostly Mail, Calendar, Notes, Messages, Slack, browser, and Google Workspace, the AX tree path is meaningfully cheaper per step. For drawing in Photoshop or playing a 3D game, pixel grounding wins. They are complementary tools, not competitors. The interesting line on the changelog is not 'computer use shipped, everyone else dies', it is 'computer use is now a primitive at the model level, so you can stop bolting on a screenshot tool and start picking which grounding signal your task actually needs'.

Is the April 15 Agents SDK 'controlled sandbox' line related to computer use?

Yes, indirectly. The sandbox is where a server-side agent gets to do its work. It is the place a hosted shell, a hosted code interpreter, and the computer-use harness can run without touching the user's actual machine. That is the architecturally interesting part for desktop agent builders: server-side computer use runs against a server-side virtual desktop, which is great for repeatable end-to-end tasks (signing into a SaaS, scraping a page) but is by definition not your laptop. If the task is 'open the Notes file I started this morning and turn it into an email draft in Mail.app', a sandbox agent has nothing to look at. A local accessibility-tree agent has the actual Notes window in its tree.

Did GPT Image 2 affect anything for desktop AI agents?

Indirectly. GPT Image 2 is a generation and editing model, not a perception model. It does not help an agent click a button. It would matter if you were building a pipeline where the agent generates illustrations or edits screenshots inside a workflow. The 50 percent Batch API discount and token-based pricing make 'agent edits an image, posts it' workflows cheaper to run at scale, which fits the small-business automation use case. It is unrelated to grounding-against-screen and so is mostly orthogonal to the computer-use thread on this same changelog.

What about the Assistants API sunset, the Sora API, GPT-5.4 nano, and the other things news roundups list under 'OpenAI April 2026'?

Several of those announcements appear in roundups but are not the dated entries on the developers.openai.com/changelog feed for April 2026. The developer-facing API changelog for April 2026 confirms three dated entries: Agents SDK update on April 15, GPT Image 2 on April 21, and GPT-5.5 plus GPT-5.5 pro on April 24. Other April announcements live on adjacent surfaces (release notes, ChatGPT release notes, marketing posts) and are best read on those specific feeds rather than mixed into the API changelog. This page deliberately stays inside the developer API changelog scope.