Local LLMs news, April 2026Context-discipline filteracp-bridge/src/index.ts 2278-2291

Local LLMs news, April 2026: every April release is context-starved, and one filter hides it

The top April 2026 local-LLMs roundups list context windows as spec-sheet bullets and never do the math. A 131K-context Qwen 3 or 128K Gemma 4 dies on turn two of a screenshot agent. Inside Fazm, a two-branch filter at acp-bridge/src/index.ts lines 2278-2291 silently drops every type:image item from MCP tool results, so every observation that actually reaches the model is text. A 500 KB PNG becomes 691 chars. That is the line that makes April 2026 local LLMs survive real desktop automation.

F
Fazm
12 min read
4.9from 200+
Every context-budget claim traced to a specific line in acp-bridge/src/index.ts
Covers all major April 2026 local-LLMs releases (Qwen 3, Gemma 4, Mistral Medium 3, Llama 4, DeepSeek R2, LM Studio + Locally AI)
Answers the question the top SERP results skip: does your favorite April open-weights SKU actually fit a 32K agent loop?

The April 2026 local-LLMs news cycle, at a glance

Qwen 3 (Apache 2.0, April 8)Qwen3-Coder-NextGemma 4 (4 sizes, Apache 2.0)Mistral Medium 3 (April 9)Llama 4 ScoutLlama 4 MaverickDeepSeek R2 (AIME 92.7%)LM Studio acquires Locally AI (April 8)Llama 3.3 70B (Latent Space top-local)Qwen 2.5 32B (best coding)Mistral Small 3.1 (best at 16 GB)Nemotron Cascade 2Kimi K2.5MLX-LM on Apple SiliconOllama + llama.cpp + LM Studio, all OpenAI-flavored

The four numbers that actually decide whether an April 2026 local LLM can drive your Mac

0Tokens consumed by one 1920x1200 screenshot (base64 in prompt)
0Characters in a Fazm browser_snapshot after the filter runs
0Line in acp-bridge/src/index.ts where the text-only branch closes
0MAX_IMAGE_TURNS, the per-session cap on deliberate Read() of screenshots

Every top April 2026 local-LLMs roundup ranks models by parameter count, Arena score, or Apache-license status. The numbers above are the ones that govern whether any of those models can actually hold an agent loop together on consumer hardware. The top result is a headline; the 691 is a line in a Fazm log.

500 KB → 691 chars

We extract only text items and skip images to keep context small.

acp-bridge/src/index.ts, line 2273 (comment above the two-branch filter)

The anchor fact: the filter is 21 lines, and it is the whole thing

Inside Fazm, every MCP tool result passes through the same handler. That handler has one job that matters for local LLMs: make sure nothing pixel-shaped ever reaches the next prompt. Two branches cover the two wire formats MCP servers can use (direct and ACP-wrapped), and a rawOutput fallback covers the pre-batched shape. None of them has a branch for type:image.

acp-bridge/src/index.ts, lines 2274-2307

Read it carefully. For each item in content[], two if-statements check whether something is text. Nothing checks for image. An item with type:image falls off the edge of both branches and never gets pushed onto texts[]. When the array is joined, images do not appear. When the prompt is assembled, images do not appear. The local LLM's context does not grow by 350K tokens. The agent keeps running.

Every April 2026 local LLM on the left, one context-safe stream in the middle, real Mac apps on the right

The interesting structural claim is that the left column is interchangeable and the middle column is not. Any April 2026 local model can be swapped in via an Anthropic-protocol shim; none of them can survive without the filter.

April 2026 local LLMs → Fazm filter (text-only) → your Mac

Qwen 3 32B
Gemma 4 mid
Mistral Medium 3
Llama 4 Scout
DeepSeek R2
Fazm filter
Safari
Mail
Xcode
Notes

The hub is the part nobody in the top SERP covers. Ship any model from the left column without the hub and a 32K to 131K context collapses inside a couple of turns. Ship the hub and the model choice is your problem, not the architecture's.

Same April 2026 local model, two observation shapes

This is the argument the top SERP results almost never draw. The model is fixed; only the payload shape changes. The context math is what decides whether the loop survives.

A 131K-context local LLM, both ways

Every tool call returns a 1920x1200 PNG, roughly 500 KB base64, roughly 350K tokens when inlined in the prompt. Turn one fills the context. Turn two is already truncated. The model starts forgetting earlier tool calls before the session is a minute old. On a local Mac with a 32B model at thinking-mode speed, this is unworkable.

  • One observation saturates a 131K-context window
  • Multi-turn planning collapses on turn two
  • Vision-capable local SKU is required (large, slow)
  • No realistic path to a full Mac task loop

The lifecycle of one tool observation, from MCP response to model prompt

Six steps. Every number below is grep-able against acp-bridge/src/index.ts.

1

1. The browser or Mac-app observation fires

In the agent loop, a tool call like browser_snapshot or macos-use_refresh_traversal completes. The raw MCP response carries a content array with mixed text and (potentially) image items. Without intervention, the image items would travel straight into the next prompt.

2

2. The Playwright MCP flag strips inline images server-side (first layer)

acp-bridge/src/index.ts line 1033 pushes --output-mode file --image-responses omit --output-dir /tmp/playwright-mcp onto the Playwright MCP argv. Screenshots go to disk; the server is asked to not inline them.

3

3. The two-branch filter runs client-side (authoritative layer)

acp-bridge/src/index.ts lines 2278-2291 iterate content[] and push into a texts[] array only when item.type === 'text' (direct MCP format) OR inner.type === 'text' (ACP-wrapped format, where inner = item.content). Every non-text item is silently dropped.

4

4. The rawOutput fallback stays strictly text

If the content[] branch yields nothing, lines 2293-2307 extract only type:'text' items from rawOutput. Image items never reach the prompt.

5

5. The final text lands in session/update

The joined text string goes to the model as the tool_result. A typical browser_snapshot arrives at ~691 characters of YAML; a macos-use window traversal at roughly one line per element, 441 elements in ~0.72s, still fits comfortably in a 32K context window.

6

6. MAX_IMAGE_TURNS caps deliberate Read() calls

If the model explicitly asks to Read() a screenshot from disk, acp-bridge/src/index.ts line 793 (MAX_IMAGE_TURNS = 20) caps how many image-bearing turns a session may consume, so a local model's context cannot be killed by over-eager visual verification.

What happens when Qwen 3 asks for a browser snapshot

The model calls a tool, the Playwright MCP server writes a PNG to disk and streams back a content array, the Fazm bridge runs the filter, and the model sees only text.

Local LLM tool call → text-only observation back

Local LLMFazm bridgePlaywright MCPFilterNext prompttool_use: browser_snapshotMCP call (--image-responses omit)content[] = [text YAML, optional image ref]hand to two-branch filter (lines 2278-2291)push text items; silently drop type:imagejoin into ~691 chars of YAMLtool_result (text only, ~170 tokens)

The same observation, two payload shapes, one local LLM

On the left, what a screenshot agent would hand a 131K-context local model for one turn. On the right, what Fazm's filter hands the same model for the same turn.

// tool_result content[] for browser_snapshot
[
  { type: "text", text: "Snapshot captured at /tmp/page.png" },
  {
    type: "image",
    source: {
      type: "base64",
      media_type: "image/png",
      data: "iVBORw0KGgoAAAANSUhEUgAAB4AAAA..."
      // ~500 KB of base64 → ~350K input tokens
      // 131K context: SATURATED on turn 1
    }
  }
]
14% fewer lines

The April 2026 local-LLMs lineup, rated for Mac agent work

Not a benchmark table. A task-fit read. Each card answers one question: with the Fazm filter in place, does this April 2026 SKU actually hold an agent loop together on a laptop-class Mac?

Qwen 3 family (Apache 2.0)

Released April 8, 2026. Sizes 0.6B to 72B, dual thinking / fast mode. Context windows up to 128K-ish in the top SKUs. Strong text reasoning. The practical Mac-agent pick at 32B once the filter runs.

Qwen3-Coder-Next

Community consensus for local coding in April 2026. Stable tool-call JSON. Strong fit for code-focused agent loops, less so for UI-heavy Mac automation.

Gemma 4 (4 variants, Apache 2.0)

Google's April drop. Small and mid-size run on consumer Macs. Good instruction following, weaker multi-step planning. Works for short agent loops.

Mistral Medium 3 (open weights)

April 9 release. Between small local and large proprietary. Strong on European languages. Pairs well with a strict tool-call validator.

Llama 4 Scout + Maverick (MoE)

Mixture-of-experts in mainstream open weights. Scout has a huge nominal context, but practical tool-use context is tighter. Quirky with nested JSON.

DeepSeek R2

AIME 92.7%, roughly 70% cheaper than frontier. Mostly API, not local-first. Reachable through the same Anthropic-protocol shim as the others.

LM Studio + Locally AI (April 8 news)

LM Studio acquired Locally AI on April 8, 2026. Local inference now has a native iOS / iPadOS runway. On the Mac side, LM Studio's server still speaks an OpenAI-flavored API; behind a ~200-line Anthropic shim, it fits the same Fazm endpoint slot.

Llama 3.3 70B (still the 'best overall' local)

Latent Space's April 2026 top-local list still crowns Llama 3.3 70B as best overall, Qwen 2.5 32B as best coding, Mistral Small 3.1 as the 16 GB RAM pick. Useful sanity check against the newer SKUs.

Verify the filter yourself

Three greps against acp-bridge/src/index.ts close the loop. Every line number below is real.

rg against acp-bridge/src/index.ts

What every top April 2026 local-LLMs roundup misses

Reading llm-stats.com, Latent Space's top-local list, llm-explorer, Till Freitag's open-source comparison, BentoML's 2026 best-of, and promptquorum's local-LLMs guide back-to-back, the overlap is total and the gap is consistent. They all cover weights. None of them cover what happens after the weights are loaded.

The structural gap in every top SERP result

  • Lists April 2026 context windows as a spec bullet, never does the per-turn math
  • Treats the model as the agent, ignores the tool-observation pipeline that feeds it
  • Assumes vision-capable SKU is mandatory for desktop automation
  • Skips the existence of an application-layer filter between MCP server and prompt
  • Frames 'run it locally' as a question of hardware class, not payload shape
  • Mentions LM Studio + Locally AI as news, never as a substrate in a filtered pipeline

April 2026 local-LLMs pipeline, with and without the Fazm filter

Same model, same hardware, same MCP servers. Only the observation-pipeline policy changes.

FeatureScreenshot agent (no filter)Fazm (filter on)
Tokens per tool observation~350K (1920x1200 PNG base64)~170 (691-char YAML)
Turns before 131K context saturates~1~40+
Minimum usable local SKU class70B+ multimodal7B-32B text-first
Works on a 16 GB MacNoYes (Gemma 4 small, Mistral Small 3.1)
Path to deliberate screenshot verificationEvery turn, no escapeRead() on demand, capped at 20 turns
Compatible with April 2026 Qwen 3 / Gemma 4 / Mistral Medium 3Only largest SKUs, barelyAny text-first SKU in lineup
File where the policy livesNo such fileacp-bridge/src/index.ts lines 2271-2307
Depends on MCP server cooperationTotallyNo; client-side filter is authoritative

Run Qwen 3 or Gemma 4 against your real Mac workflows

20 minutes, your laptop, your local model. We'll wire it to Fazm and run through a real app loop.

Book a call

FAQ

Frequently asked questions

What are the headline April 2026 local-LLMs stories worth knowing?

LM Studio announced the acquisition of Locally AI on April 8, 2026, which points the local-AI story toward on-device iOS and iPadOS and confirms that the mobile end of local inference is no longer optional. On the weights side, Qwen 3 (Apache 2.0, dual-mode thinking / fast) and Qwen3-Coder-Next are the community consensus picks for general reasoning and coding. Gemma 4 shipped in four Apache 2.0 sizes. Mistral Medium 3 landed with open weights to fill the mid-tier gap. Meta's Llama 4 Scout and Maverick brought mixture-of-experts into mainstream open weights. DeepSeek R2 clocked AIME 92.7% at roughly 70 percent less than frontier cloud. Latent Space's April 2026 top-local list pins Llama 3.3 70B, Qwen 2.5 32B, and Mistral Small 3.1 as the best practical picks by class.

Why does context window matter more than parameter count for a local Mac agent?

An agent does not get one shot; it runs a loop. Every tool call adds the latest observation to the prompt, and the model re-reads the whole trail each turn. A screenshot-based agent sends a PNG that tokenizes to roughly 350,000 input tokens per 1920x1200 frame (OpenAI's image tokenizer at base64-in-prompt rates). A 131K-context Qwen 3 truncates on turn one. A 128K Gemma 4 dies on turn two. Parameter count buys you better answers per token; context window buys you more turns before the agent forgets what it was doing. For desktop automation, turns matter more.

What is the exact line in Fazm that makes this problem disappear?

acp-bridge/src/index.ts lines 2278-2291. The handler for session/update iterates the content array attached to every MCP tool result and, for each item, matches two branches: item.type === 'text' (direct MCP format) and inner.type === 'text' where inner = item.content (the ACP-wrapped format). Items that are not text get no branch to land in, so they are silently dropped on the floor. There is also a fallback at lines 2293-2307 that extracts text from rawOutput and never touches type:'image' items. The net effect is that a browser_snapshot which would have been a 500 KB base64 PNG (~350K input tokens) becomes a 691-char YAML text (~170 tokens) before it ever reaches the model.

Why does the MCP --image-responses omit flag alone not solve this?

It tells the Playwright MCP server not to inline images, but it does not apply to native Mac-app observations captured through the macos-use MCP, and in extension mode with @playwright/mcp@0.0.68 the flag is partially ignored. The belt-and-suspenders fix is to set the flag at acp-bridge/src/index.ts line 1033 AND filter again on the Fazm side at lines 2271-2307. The filter is the authoritative layer; if someone upstream leaks an image item, the filter still drops it.

Which April 2026 local LLM is the actual best fit for a Mac agent?

A text-first reasoning SKU in the 7B to 32B band. Because Fazm's observation payload is structured accessibility-tree text, not pixels, you do not need a large multimodal local. Qwen 3 32B in thinking mode is the strongest general pick for reasoning; Qwen3-Coder-Next is the strongest for code-heavy loops. Gemma 4 mid-size variants work for shorter loops. Mistral Medium 3 is viable if you pair it with a strict tool-call validator in the shim. The headline numbers in the top April 2026 roundups skip the question entirely because they treat local LLMs as a benchmark exercise, not an agent substrate.

What does a real tool observation look like after the filter runs?

A Playwright browser_snapshot comes back as YAML with one line per interactive DOM node; a macos-use window traversal comes back as '[AXButton (button)] "Send" x:6272 y:-1754 w:56 h:28 visible' for each element (the macos-use binary walks through AXUIElementCreateApplication and typically emits ~441 elements in ~0.72 seconds). Both are plain UTF-8 text. Screenshots still get saved to disk (at /tmp/playwright-mcp for the browser, /tmp/macos-use for Mac apps) so the model can Read() a specific image with a deliberate tool call, but that only happens when the agent explicitly asks.

How does this interact with the other April 2026 local-LLMs news, specifically LM Studio and Locally AI?

LM Studio's April 8 acquisition of Locally AI signals that on-device, mobile, and Mac-local inference is getting a real UX owner. What that means in practice for desktop agents is that the server side (running Qwen 3 on localhost:11434 via Ollama, llama-server from llama.cpp, or MLX-LM behind LM Studio's runtime) is becoming a commodity. The hard part becomes what you feed the server. A screenshot loop feeds it pixels it cannot afford. A text-first accessibility-tree loop feeds it structured observations it can reason over for dozens of turns.

Does the filter hurt agent capability? What if the model actually needs to see a screenshot?

No. Screenshots still exist on disk. The filter only strips inline image items from tool results. If the model genuinely wants to see pixels (for example, to verify a visual diff), it calls the Read tool with the screenshot path (/tmp/playwright-mcp/page-2026-04-20T12-34-56.png). That Read call does emit an image content block. A per-session cap at acp-bridge/src/index.ts line 793, MAX_IMAGE_TURNS = 20, prevents any one session from blowing its own budget by reading screenshots in every turn.

Can I verify the two-branch filter myself without installing Fazm?

Yes. rg -n "content as" acp-bridge/src/index.ts locates the content-array extraction block; the surrounding lines 2271-2307 show the two branches and the rawOutput fallback. rg -n "image-responses" acp-bridge/src/index.ts gives you line 1033 where --image-responses omit is passed to Playwright MCP. rg -n "MAX_IMAGE_TURNS" gives line 793. Three greps, three numbers, the whole context-discipline contract.

What is the single biggest thing the top SERP roundups miss about April 2026 local LLMs?

They treat the model as the product. It is not. For desktop automation, the model is one interchangeable component; the observation payload shape is the thing that decides whether any local model in the April 2026 lineup can do the job. A screenshot agent with a 131K-context Qwen 3 is dead on turn two. An accessibility-tree agent with the same Qwen 3 runs dozens of turns. The observation is upstream of the weights, and almost every top-ten local-LLM article skips that part.

How does this interact with Fazm's Custom API Endpoint wiring?

Independently. The Custom API Endpoint (ANTHROPIC_BASE_URL, set at Desktop/Sources/Chat/ACPBridge.swift line 381) is what points Fazm at a local model. The image-drop filter (acp-bridge/src/index.ts lines 2271-2307) is what keeps that model's context from collapsing. Both are required together: the endpoint flag tells the agent where to talk, the filter decides how much it has to read per turn. Swap the endpoint and you change which weights answer; keep the filter and any April 2026 local LLM fits.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.