Open-source LLM releases, April 2026, re-scored on the one input shape every leaderboard ignores
Every recap ranks Qwen 3, Gemma 4, Mistral Medium 3, Llama 4 Scout, and DeepSeek R2 on MMLU, AIME, HumanEval, and BFCL. None of them score the payload a shipping Mac agent actually hands the model on every turn: a live macOS accessibility tree, emitted by a native binary called mcp-server-macos-use bundled inside the Fazm app. That changes which April 2026 release you should reach for, and by how much.
What the April 2026 leaderboards miss
April 2026 open-weights lineup
Qwen 3, Gemma 4, Mistral Medium 3, Llama 4 Scout, DeepSeek R2. All ranked on MMLU, AIME, HumanEval, BFCL.
The anchor fact: Fazm ships a native binary called mcp-server-macos-use
Every April 2026 open-source LLM that ends up driving Fazm lands its tool calls on the same native binary. It lives at Contents/MacOS/mcp-server-macos-use inside the signed, notarized macOS app bundle. The ACP bridge wires it in at acp-bridge/src/index.ts:63 and registers it as the "macos-use" MCP server at line 1059. There is no Python wrapper, no screenshot step, no detour through pyautogui. It is a Mach-O binary that speaks MCP over stdio and calls Apple's AX C API directly.
Anchor fact, verifiable
Gate: every AX traversal is preceded by AXUIElementCopyAttributeValue at Desktop/Sources/AppState.swift:441. If that C call returns .apiDisabled, Fazm aborts. No AX permission, no accessibility tree, no model handoff, no April 2026 open-weights tool call.
Screenshot-based agents bypass this gate. Fazm enforces it, because the input shape for every April 2026 release routed through the app starts at this C API.
A single AX-tree payload, as the April 2026 model sees it
This is what mcp-server-macos-use returns to the model. Every line is one on-screen element: role, label in quotes, frame, visibility. An April 2026 open-source LLM reads these rows and chooses a target by role + label, never by pixel coordinate.
1142 elements, 11834 tokens. That is the payload any April 2026 open-source LLM must handle on every turn. Not a 1024x768 image, not an OCR chain, not a DPR-sensitive pixel grid. Structured text.
Tokens of AX-tree text per turn, one real Slack window
On-screen elements flattened into the model context
April 2026 open-weights SKUs in the released lineup
Releases re-scored in this page on the AX-tree shape
Leaderboard view vs. AX-tree agent view
The April 2026 open-source release that wins on MMLU is not the one that wins on a live Mac AX-tree payload. Here is the re-scored view, feature by feature.
| Feature | Screenshot agent (typical) | AX-tree agent (Fazm, April 2026) |
|---|---|---|
| April 2026 release date | Varies across April releases | Qwen 3: Apr 8, Mistral Medium 3: Apr 9, Gemma 4: early April |
| Payload the model actually sees | Screenshot PNG plus OCR, per turn | AX-tree text, role + label + frame, per turn |
| Tool-call fidelity required | Pixel coordinates guessed from image | Structured arguments extracted from text rows |
| Best fit model class | 70B+ multimodal with a visual encoder | 32B text-first with strong structured output |
| Typical input token budget per turn | One image token block (fixed) | 8K to 30K tokens of AX rows (variable) |
| Failure mode on big apps (Xcode, Logic) | Misaligned click at wrong DPR | Context overflow if model < 32K context |
| Right April 2026 pick | Llama 4 Scout vision | Qwen 3 32B thinking, DeepSeek R2, Gemma 4 mid |
Every April 2026 release routes through one bundled binary
The LLM is swappable. The binary is not. Any April 2026 open weights model the user points Fazm at, through the Custom API Endpoint setting, emits tool calls that land on mcp-server-macos-use, which then drives the real Mac apps through the accessibility API.
April 2026 open-weights releases, one AX-tree pipe, every Mac app
What the call flow actually looks like with Qwen 3 32B driving
A single user intent ("send that Slack message") turns into eleven messages across six actors. Notice where the AX tree enters the model context, and where it does not.
User prompt to sent Slack message, via Qwen 3 32B
No screenshot is taken. No image token is consumed. The model's only view of Slack is the AX-tree text that the bundled binary returned at step three.
The same action, two different input shapes
A screenshot-based agent and an AX-tree agent both try to click the Slack Send button. The tool_use an April 2026 open-source LLM has to emit is very different, and the failure modes diverge completely.
Tool_use shapes compared
// classic screenshot-based agent: the LLM reads a PNG
// guesses where the button sits in pixel coordinates,
// and trusts OCR on a screen it cannot actually see
{
"tool_use_id": "tu_01H9...",
"name": "pyautogui.moveTo",
"input": {
"image_reference": "screen_2026_04_18_14_57_02.png",
"target_description": "the small blue Send button near the bottom right of the visible Slack window",
"estimated_x": 1458,
"estimated_y": 831,
"confidence": 0.62
}
}
// then a second call to pyautogui.click
// then a third to re-screenshot
// then often a fourth to correct a 12px offset after the DPR misread
Left: the model has to guess coordinates off a PNG and trust OCR it never sees. Right: the coordinates are already in the tool_use arguments, lifted directly from the AX-tree row. Three out of four round-trips disappear.
Five steps, one turn, with any April 2026 release in the middle
This is the inner loop every time the agent takes an action. Swap Qwen 3 for Gemma 4 for DeepSeek R2; the five steps do not change. That is what makes it a fair rescoring lane.
Inside one agent turn
1. The agent asks for the scene
Fazm's ACP bridge calls into the bundled mcp-server-macos-use binary. The binary hits AXUIElementCopyAttributeValue against the frontmost pid and walks every visible element in the active window and its children.
2. Text, not pixels, lands in the model context
Each element becomes a single row: role, label, value, x, y, width, height, visible. An April 2026 open-weights model like Qwen 3 32B sees 10K to 30K tokens of these rows. No PNG, no OCR, no wasted visual-encoder compute.
3. The model emits one tool_use
The April 2026 release under test produces a single macos-use:click_and_traverse tool_use with element, role, and coordinates lifted straight from the AX-tree row. Text-first reasoning SKUs do this cleanly; vision-heavy SKUs burn capacity they never use.
4. The bundled binary performs the press
mcp-server-macos-use computes the center of the matched element and calls AXUIElementPerformAction with kAXPressAction. No cursor warp, no key simulation into the wrong window, no DPR math.
5. Fresh AX tree returns inline
The binary re-traverses the frontmost app and returns the new AX-tree text in the tool result. The model's next turn already has the after-state. No second screenshot round-trip, no waiting for the image channel.
The tool_use shape an April 2026 model must produce
This is the output contract. Every April 2026 open-source release plugged into Fazm must emit this shape cleanly. Qwen 3 32B in thinking mode nails it. Gemma 4 mid-size handles it. Llama 4 Scout needs a validator to coerce deeper nesting.
The six April 2026 releases, ranked on AX-tree fit
Not ranked on MMLU or BFCL. Ranked on how cleanly the model handles an accessibility-tree payload, emits a single valid macos-use tool_use, and stays coherent over a multi-turn Mac agent session.
Qwen 3 32B (thinking mode)
Apache 2.0, April 8. Text-first, dual thinking and fast modes, stable nested JSON under chained tool calls. The practical April 2026 pick for an AX-tree loop when the model has to weigh ten sibling AXButton elements with near-identical labels and pick the right one. Handles 30K+ token trees without argument-fidelity drop.
DeepSeek R2
AIME 92.7%, roughly 70% cheaper inference than frontier cloud. Reasoning depth is top of the pack. Strong choice behind an Anthropic-protocol shim, where it reasons over the AX tree at a budget most frontier options cannot match.
Gemma 4 (12B)
Apache 2.0. Solid instruction following, easier on consumer Mac memory. Works for short AX-tree loops where the tree stays under ~8K tokens. Weaker on multi-step planning once the tree fans out across nested windows.
Mistral Medium 3
Open weights, April 9. Fills the gap between small local and large proprietary. Workable for AX-tree control if paired with a strict tool-call validator in the shim. Strong on multilingual labels when the frontmost app is non-English.
Llama 4 Scout
MoE open weights. Surprisingly capable for its active-parameter count. Quirky with deeply nested JSON tool arguments, which surfaces fast when an Xcode AX tree lands and the model has to pick a target five levels deep.
Qwen3-Coder-Next
Community pick for local coding in April 2026. Not the main fit for AX-tree control, but useful behind the same Fazm shim when the task is 'write a small script, run it, report back' rather than 'click, type, press Enter'.
Want the wiring?
The exact env-var contract that lets an April 2026 open-weights model drive this same AX-tree loop is one line in ACPBridge.swift. The sister guide walks it end to end.
Read the local-LLM wiring guide →FAQ, from the April 2026 open-source rescoring
What were the major open-source LLM releases in April 2026?
Alibaba Qwen 3 on April 8 (0.6B through 72B, dual thinking and fast modes, Apache 2.0), Mistral Medium 3 on April 9 with open weights sitting between small local and frontier proprietary, Google Gemma 4 with four variants under Apache 2.0, Meta Llama 4 Scout and Maverick pushing mixture-of-experts into mainstream open weights, and DeepSeek R2 landing at AIME 92.7% with roughly 70 percent lower inference cost than frontier cloud. All of them get benchmarked on MMLU, AIME, HumanEval, long-context retrieval, and Berkeley Function Calling Leaderboard (BFCL) numbers.
Why doesn't any April 2026 roundup score these models on accessibility-tree input?
Because almost every agent framework shipping today sends screenshots to the model, not structured accessibility-tree text. The benchmarks track what the frameworks test. Fazm is one of the few consumer Mac agents that runs the control loop off real accessibility APIs through a native bundled binary (mcp-server-macos-use), so the payload the model sees is role plus label plus value plus coordinates per on-screen element, not a 1024x768 PNG. The leaderboard tells you how a model performs on a screenshot pipeline. The AX-tree payload shape is a different regime entirely.
What is mcp-server-macos-use and how can I verify it exists?
It is a native Mach-O binary Fazm ships inside Contents/MacOS/mcp-server-macos-use of the signed, notarized app bundle. The ACP bridge references it at acp-bridge/src/index.ts line 63 as macosUseBinary = join(contentsDir, 'MacOS', 'mcp-server-macos-use') and registers it as the 'macos-use' MCP server at line 1059 with no Python and no Node wrapper. Every AX tool call an LLM fires inside Fazm runs through that binary and out to AXUIElementCreateApplication on the frontmost process.
Which April 2026 open-source LLM is actually the best fit for this input shape?
The text-first reasoning SKUs win because the payload is structured text. Qwen 3 32B in thinking mode is the strongest practical local pick: it handles deep nested tool-call JSON and the accessibility tree's long-tail of sibling elements. DeepSeek R2 is also very capable if you are willing to run it through an API gateway rather than locally. Gemma 4 mid-size works for short task loops. Llama 4 Scout is surprisingly capable for its footprint but gets quirky with deeply nested JSON tool arguments, which matters once an Xcode or Logic Pro AX tree lands in the context.
Why are vision SKUs the wrong pick for a Mac AX-tree agent?
Because the agent loop never needs a screenshot. The accessibility tree carries the role, label, value, and frame of every element the user can interact with. A vision SKU spends memory and parameters on a visual encoder that never runs. In April 2026 that translates directly into hardware cost: a 70B multimodal model that a user cannot load on a 16GB M-series Mac loses to a 32B text-first model that fits, and the text-first model has equal or better structured output stability.
How does an April 2026 open-source LLM get wired into a real AX-tree loop inside Fazm?
Through an Anthropic-protocol shim. Fazm's agent speaks the Anthropic messages API, and ACPBridge.swift lines 379 to 382 copy a UserDefaults string called customApiEndpoint into the ACP subprocess environment as ANTHROPIC_BASE_URL. Point that endpoint at a proxy that rewrites Anthropic requests into Qwen 3, Gemma 4, or Mistral Medium 3 calls, and every tool_use block the model emits reaches the macos-use MCP server the same way a Claude Sonnet 4.6 tool call would. No rebuild, no fork, no code change.
What does a single AX-tree line actually look like in the payload an LLM sees?
Each element is a single text row with a role (AXButton, AXTextField, AXMenuItem, AXStaticText, AXCheckBox), a human-readable label, a value, x and y coordinates, width and height, and a visibility flag. The traversal is produced by AXUIElementCopyAttributeValue calls in the bundled binary and flattened into readable rows so a text-first LLM can select an element by label and role without parsing pixels. The shape is deterministic and stable across apps, which is exactly the kind of input April 2026 reasoning models handle cleanly.
What about function-calling benchmarks like BFCL - don't they cover this?
BFCL grades whether a model produces syntactically valid JSON against curated schemas and semantically correct arguments against toy functions. It does not exercise long, live, repeatedly refreshed accessibility trees with hundreds of sibling nodes, concurrent apps, and coordinate-aware click arguments. A model that scores 90+ on BFCL can still stall on a real AX payload when the tree is 15K tokens, the user has three Slack windows open, and the correct target is the second AXButton labeled 'Send' in the third window. The right benchmark for this job would measure end-to-end task completion against a real app, not schema conformance.
Can any April 2026 open-weights model fully replace Claude Sonnet 4.6 for Mac agent work?
Not yet on sustained tool-call reliability across multi-minute agent runs. Qwen 3 32B in thinking mode is the closest practical match on raw reasoning. Gemma 4 is strong on instruction following but weaker on deep planning. Llama 4 Scout is surprisingly capable. DeepSeek R2 rivals frontier on reasoning. The honest read in April 2026 is that local and open-weights April models are viable for reflexive single-step queries and short agent loops, and frontier cloud is still the pick for sustained end-to-end Mac automation. Fazm's Custom API Endpoint lets the user run both: a cheap fast path and a strong hard path, against the same AX-tree payload.
How should I pick between the April 2026 open-source releases for my own Mac agent project?
Start with the input shape, not the leaderboard. If you send screenshots, compare vision SKUs on pixel-grounding latency and OCR fidelity. If you send structured accessibility-tree text (what Fazm does), favor text-first reasoning SKUs that handle 10K to 30K tokens of nested structured output reliably. Qwen 3 32B is the default pick, with Gemma 4 mid-size as a memory-constrained fallback and DeepSeek R2 through an API gateway when you want frontier reasoning without the frontier price.
Run your April 2026 open-weights pick against a real Mac
Download Fazm, open Settings, point Custom API Endpoint at your local Qwen 3, Gemma 4, Mistral Medium 3, or DeepSeek R2 proxy, and watch the AX-tree payload land in your model context.
Try Fazm free
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.