Open source LLM 2026 / runtime layer

Open source LLM releases 2026 news: the ten-event plus nine-field runtime layer every recap leaves out.

Every 2026 open source LLM release gets benchmarked on parameter count, context window, and ELO. Llama 4 Scout 10M, Qwen 3 Apache 2.0 at 128K, Gemma 4 at 131K, DeepSeek V3.2, GLM-5.1 under MIT at 200K, Mistral Medium 3 open weights, Arcee Trinity 400B. What no roundup covers is what a consumer Mac agent has to do at runtime to make any of those models observable, per prompt, to a user sitting in front of a floating bar. That layer is a 267-line MIT patch against the stock @agentclientprotocol/claude-agent-acp entry point, it is the same file for every model the bridge speaks to, and it is the uncopyable thing this page is about.

Download Fazm for Mac Jump to the patch

Matthew Diakonov, founder, Fazm

Published April 20, 202611 min read

4.8from open source (MIT) desktop agent

267-line patched-acp-entry.mjs

10 dropped ACP events re-forwarded

5 usage fields + 4 _meta fields per prompt

Every 2026 open source LLM roundup stops at the spec sheet.

The spec sheet is not the runtime. This is the runtime layer.

Llama 4 Scout 10M, Qwen 3 128K, Gemma 4 131K, DeepSeek V3.2, GLM-5.1 200K.

Every roundup lists params, context, benchmark. That is the spec sheet.

The spec sheet is not the runtime. The runtime is what the agent sees.

267 lines at acp-bridge/src/patched-acp-entry.mjs re-forward 10 dropped events.

Plus 5 usage fields and 4 _meta fields on every prompt return.

That is the layer every 2026 open source LLM recap leaves out.

0:00 / 0:12

Models covered in every 2026 roundup

Llama 4 Scout 10MLlama 4 Maverick 400BQwen 3 Apache 2.0Qwen 3 128KGemma 4 131KGemma 4 four variantsMistral Medium 3 open weightsDeepSeek V3.2DeepSeek 128KGLM-5.1 MITGLM-5.1 200KGLM-5.1 754B MoEArcee Trinity 400BArcee Apache 2.0OpenRouter routingGroq free tierTogether AIFireworksself-hosted vLLMAnthropic 0.29.2

What every top result for this keyword covers, and what every one of them misses

I searched the same terms. The first page is consistent: a table of 2026 open source releases, sorted by month, with four or five columns. Parameters, context window, license, provider stack, benchmark score. Llama 4 Scout ships with a 10M context window. Qwen 3 is Apache 2.0. GLM-5.1 is under MIT at 754B total with 200K context. Gemma 4 ships in four variants, all under Apache 2.0. DeepSeek V3.2 holds 128K. Mistral Medium 3 releases with open weights. Arcee Trinity targets enterprise self-hosting at 400B, Apache 2.0. Every roundup hits the same points.

None of them describes what happens inside a consumer agent when the user actually talks to one of those models. The model itself is only one of three things that shape a user session. The other two are the agent runtime (what the agent does with streaming events) and the bridge (what gets surfaced to the UI). Both of those are invisible in a spec-sheet roundup. Both of those are what Fazm actually ships.

The runtime layer for every model Fazm speaks to, open source or not, lives in one file. That file is the rest of this page.

0lines in patched-acp-entry.mjs

0dropped ACP event types re-forwarded

0usage + _meta fields added per prompt

0prototype methods monkey-patched

The patch, lifted verbatim.

This is the forwarder block that catches ten event types the stock ACP agent drops. Each branch is annotated with its exact line number in the MIT-licensed source, so you can open the file and read it yourself at acp-bridge/src/patched-acp-entry.mjs.

acp-bridge/src/patched-acp-entry.mjs

10 events

“Every 2026 open source LLM roundup lists parameter count, context window, and benchmark score. None of them show the ten ACP session-update types a consumer agent has to re-forward to turn any of those models into a usable, observable runtime.”

patched-acp-entry.mjs, lines 62 to 201

The ten dropped events, each pinned to a line.

You can open the file and read the comment next to each branch. The numbers below are not decorative, they are the actual line anchors in the shipping file at 267 lines total.

compact_boundary (line 65)

Emitted when the SDK finishes compacting an over-window context. Carries trigger (auto/manual) and pre_tokens. Forwarded so the UI can clear the compacting banner.

status (line 74)

Session status_change. Tells the UI whether the agent is idle, thinking, waiting on a tool, or errored. Stock ACP eats this.

task_started (line 79)

A long-running task has begun. Carries taskId and description. Pairs with task_notification updates.

task_notification (line 88)

Periodic progress update for an active task. Three fields: taskId, status, summary.

api_retry (line 98)

Backend retried after a 429, 500, 529, or network error. Surfaces HTTP status and error category plus attempt + max_retries.

rate_limit_event (line 132)

Provider rate-limit envelope. Eight fields including resetsAt, rateLimitType, utilization, isUsingOverage, surpassedThreshold.

tool_progress (line 156)

Tool call elapsed time in seconds. Lets the UI render this tool has been running for 8.2s without the call completing.

tool_use_summary (line 167)

Post-tool-batch summary with preceding tool_use_ids. Lets the UI coalesce a batch of tool calls into one line.

compaction_start (line 185)

Stream_event for content_block_start with type compaction. Fires when the model begins streaming a summary of old turns.

compaction_delta (line 191)

Each text chunk of the compaction summary, as it streams. Lets the UI render the compaction live instead of a spinner.

Ten dropped event types, one forwarder, one ACP client

The prompt() patch: nine fields every roundup skips.

The second patch at lines 212 to 261 wraps the prompt() prototype method. The original result object gets five new usage fields and a four-field _meta block. The totalTokens formula at line 223 is not an estimate, it is the literal sum of input, cache-write, cache-read, and output token counts the SDK returns.

acp-bridge/src/patched-acp-entry.mjs

Compaction, streaming, live.

The third piece is the compaction-stream forwarder at lines 181 to 201. When a 2026 open-source model with a 128K context (Qwen 3, Gemma 4, DeepSeek V3.2) saturates mid-session, the SDK streams a summary of the earlier turns as a sequence of content_block_delta chunks with delta.type === compaction_delta. The patch forwards every chunk as a sessionUpdate so the floating bar can render the summary live instead of showing a spinner.

acp-bridge/src/patched-acp-entry.mjs

What the user sees during a long Llama 4 Scout session against an OpenRouter backend

The UI shows a spinner. Mid-session the spinner stays up for 14 seconds, then 32, then 48. The user does not know whether the agent is thinking, compacting the context, rate-limited by OpenRouter, or hung. At the end of the turn no per-prompt cost is reported. Cache-read vs cache-write tokens are never exposed. tool_progress events with elapsed_time_seconds are silently dropped. The terminal_reason on session exit is unavailable.

Spinner is the only signal
No per-prompt USD cost
No cache-read / cache-write split
Rate-limit events dropped
Compaction streams dropped
Tool elapsed time dropped

One prompt through the patched bridge, step by step

User selects an open source model (Llama 4 Scout via OpenRouter)

Fazm calls acpRequest("session/set_model", { sessionId, modelId }) at acp-bridge/src/index.ts line 1495 after the ACP session is created.

User sends a prompt. ClaudeAcpAgent.prompt() is invoked.

The patched version at lines 212 to 261 calls the original first, then inspects session._lastCostUsd which the wrapped query.next() populated during streaming.

Stream arrives. query.next() returns items one by one.

The patch at line 39 wrapped query.next(). Each item is inspected at line 43 (result type), line 62 (system type), line 132 (rate_limit_event), line 156 (tool_progress).

A rate_limit_event arrives from OpenRouter's backend

The line 132 forwarder packs rate_limit_info into a sessionUpdate with eight fields including resetsAt and utilization.

A compaction content_block_start arrives mid-stream

Line 185 forwards sessionUpdate: compaction_start. Subsequent compaction_delta chunks at line 191 stream the summary text live.

The terminal result type arrives with total_cost_usd

Line 46 computes the delta since the previous prompt: session._lastCostUsd = item.value.total_cost_usd - prevSessionCost.

prompt() returns. The patch at line 234 builds the augmented object.

usage carries 5 token fields (totalTokens = input + cacheWrite + cacheRead + output at line 223). _meta carries costUsd + terminalReason + lastApiError + errors.

Swift UI consumes both the sessionUpdate stream and the augmented prompt return

Floating bar renders compacting context..., per-prompt USD, cache-read ratio, rate-limit countdown, elapsed tool time. None of those are available without this patch.

Where each of the ten events enters the system

Grep it yourself.

Every claim on this page is a grep away from verification. Here are the four commands.

acp-bridge verification

Stock Claude Agent SDK vs the Fazm runtime layer

Feature	Stock @agentclientprotocol/claude-agent-acp ^0.29.2	Fazm (patched-acp-entry.mjs, 267 lines, MIT)
Per-prompt USD cost shown to user	Not surfaced by stock ACP	_meta.costUsd from prompt() patch lines 212 to 261
Cache-read vs cache-write token split	Not exposed	usage.cachedReadTokens and cachedWriteTokens at lines 239 to 240
Rate-limit countdown with resetsAt	Dropped silently	rate_limit_event forwarder at line 132, eight fields
HTTP status + error category on retry	Generic retry counter only	api_retry forwarder at lines 98 to 125 with errorStatus + errorType
Compaction progress in real time	Session appears hung	compaction_start line 185, compaction_delta line 191
Tool elapsed time per call	Not emitted	tool_progress forwarder at line 156 with elapsed_time_seconds
Task lifecycle (started + notification)	Dropped	task_started line 79, task_notification line 88
Structured terminal_reason on exit	Unavailable	_meta.terminalReason captured at line 53
Works for any model the SDK speaks to	Closed to Anthropic models	Model-agnostic, bridge wraps ClaudeAcpAgent prototype

Ten independently grep-verifiable claims

File exists: acp-bridge/src/patched-acp-entry.mjs, 267 lines (wc -l)
Build script copies it to dist: tsc && cp src/patched-acp-entry.mjs dist/patched-acp-entry.mjs
compact_boundary forwarder at line 65
status forwarder at line 74
task_started at line 79, task_notification at line 88
api_retry at line 98 with HTTP status + error category
rate_limit_event at line 132 with 8 fields including resetsAt
tool_progress at line 156, tool_use_summary at line 167
compaction content_block_start at line 185, compaction_delta at line 191
prompt() patch at lines 212 to 261: 5 usage + 4 _meta fields, totalTokens formula at line 223

Why this is the spine of the product.

Fazm's differentiator is that it runs on real macOS accessibility APIs instead of screenshot loops, and it works with any Mac app, not just the browser. But the model that consumes those AX trees is replaceable. Today it ships with claude-sonnet-4-6 as the default (acp-bridge/src/index.ts line 1245). Tomorrow, if someone wires Llama 4 Scout through an OpenRouter provider path, the AX tree flows into the same prompt(), the same streaming pipeline, the same patched-acp-entry.mjs.

The 267 lines are what makes the swap safe. If you change the model tomorrow to a 200K GLM-5.1, the cost meter still reads because _meta.costUsd is sourced from the SDK result. The rate-limit countdown still reads because rate_limit_event has a normalized shape. The compacting context banner still works because stream_event with content_block.type === compaction still arrives. The user experience survives the model swap because the observability layer lives below the model boundary.

That is the part of the story that every 2026 open source LLM release recap leaves out, and the part a shipping consumer desktop agent cannot skip.

See the ten-event forwarder running live against any 2026 open source LLM.

15-minute demo: patched-acp-entry.mjs in a live Fazm session, switching between Llama 4, Qwen 3, and Sonnet without touching the UI.

Book a call →

Frequently asked questions

What are the major open source LLM releases in 2026?

The shortlist every roundup is tracking through April 2026: Meta Llama 4 Scout (MoE with a 10M context window) and Llama 4 Maverick (MoE, 400B total parameters), Alibaba Qwen 3 under Apache 2.0 with 128K context, Google Gemma 4 across four variants under Apache 2.0 with 131K context, Mistral Medium 3 with open weights, DeepSeek V3.2 at 128K, Zhipu GLM-5.1 under MIT at 200K and 754B MoE, and Arcee Trinity as a 400B Apache 2.0 model for enterprise self-hosting. Every roundup covers the same four columns: parameter count, context window, license, benchmark score.

What do the top open source LLM 2026 roundups miss?

They describe releases as a spec-sheet table. They do not describe what a desktop agent does at runtime when it speaks to those models through the Claude Agent SDK. Specifically: the default @agentclientprotocol/claude-agent-acp entry point drops ten kinds of session events on the floor (compaction, rate limits, tool progress, API retries, task lifecycle, status change) and never surfaces per-prompt cost in USD or cache-read/cache-write token splits. That is what shapes the actual user experience for any of the 2026 models, not the benchmark score.

Where in the Fazm source is the open source runtime observability layer?

One file: /Users/matthewdi/fazm/acp-bridge/src/patched-acp-entry.mjs. 267 lines total (wc -l). It imports ClaudeAcpAgent from @agentclientprotocol/claude-agent-acp and monkey-patches two prototype methods: createSession at line 20 (wraps query.next() so dropped stream events get re-forwarded as ACP sessionUpdates) and prompt at line 211 (augments the return value with usage and _meta). The package version is pinned at acp-bridge/package.json line 15 to @agentclientprotocol/claude-agent-acp ^0.29.2.

Which specific session events does the patch re-forward?

Ten, at specific line numbers you can open and verify: compact_boundary at line 65, status at line 74, task_started at line 79, task_notification at line 88, api_retry at line 98 (with the HTTP status code and error category, plus attempt and retry_delay_ms), rate_limit_event at line 132 (eight fields including resetsAt, rateLimitType, utilization, overageStatus, isUsingOverage, surpassedThreshold), tool_progress at line 156 (with elapsed_time_seconds), tool_use_summary at line 167, compaction content_block_start at line 185, and compaction_delta at line 191. The stock ACP agent drops all ten on the floor.

What does the prompt() patch actually surface?

Lines 212 to 261. For every prompt that returns successfully, the patch augments the result with a usage object carrying five fields: inputTokens, outputTokens, cachedReadTokens, cachedWriteTokens, totalTokens. The totalTokens formula at line 223 is input_tokens + cache_creation_input_tokens + cache_read_input_tokens + output_tokens. It also adds a _meta object with four fields: costUsd (the delta since the last prompt, computed at line 46 as total_cost_usd - prevSessionCost), terminalReason, lastApiError (the last captured HTTP status + error type), and errors (any structured errors from the SDK result).

Why does this matter more for open source LLMs than for Anthropic models?

It matters for both, but the signal-to-noise jump is bigger for open source models because the provider stack is heterogeneous. OpenRouter, Together AI, Fireworks, Groq, Lambda, self-hosted vLLM, each one emits rate-limit information and retry-after headers in a slightly different shape that the Claude Agent SDK normalizes into the rate_limit_event and api_retry types. Without the patch at lines 98 to 125 and 132 to 152, that normalization is useless, the events just get thrown away. With the patch, the Fazm floating bar can say Groq rate-limited you, resets in 42s on a Llama 4 Scout session the same way it would on a Sonnet session.

How do I verify the ten-event + nine-field claim?

Clone the Fazm app (the desktop source lives in /Users/matthewdi/fazm, the repo at github.com/mediar-ai/fazm). Then run: wc -l acp-bridge/src/patched-acp-entry.mjs (expect 267), grep -n sessionUpdate acp-bridge/src/patched-acp-entry.mjs (you will see exactly the sessionUpdate types listed in this page), grep -n _lastCostUsd acp-bridge/src/patched-acp-entry.mjs (expect three hits at lines 45, 216, 222 and the cleanup at 250). The usage field names and the totalTokens formula are visible in one continuous block at lines 217 to 223.

How is the patch invoked in the shipping Fazm binary?

acp-bridge/package.json build script is tsc && cp src/patched-acp-entry.mjs dist/patched-acp-entry.mjs, so the mjs ships alongside the compiled TypeScript. The bundled acp-bridge process calls runAcp() at line 264 after installing the two prototype patches, so every Claude Agent SDK session the bridge spawns inherits them. There is nothing conditional, the patch is on for every model selected via session/set_model, including the default claude-sonnet-4-6 set at acp-bridge/src/index.ts line 1245 and any open-source model the user configures via the provider-selection path.

Does the patch drop the stock ACP events, or does it run on top of them?

It runs on top. Line 38 captures the original query.next reference before overwriting it at line 39, and the patched implementation calls originalNext first at line 40, then inspects the returned item, forwards new sessionUpdates for the ten dropped event types, and finally returns the same item to the caller. The prompt() patch at line 213 is the same pattern: it calls originalPrompt first, then augments the return. If the patch had a bug, the original behavior would still execute.

What does the user see that would not exist without this patch?

Four concrete things in the floating bar UI during a long Fazm session against any LLM (open source or closed): one, per-prompt USD cost to four decimals, sourced from _meta.costUsd. Two, cache-read vs cache-write token splits, sourced from usage.cachedReadTokens and usage.cachedWriteTokens. Three, a rate-limit countdown (for example Resets in 2m 18s) when a provider sends a rate_limit_event, sourced from the line 132 forwarder with the resetsAt field. Four, compacting context… during long sessions, sourced from the line 185 and 191 forwarders.

What about api_retry? Why is the HTTP status code specifically called out?

Because in heterogeneous open source deployments (a local vLLM at home, a Groq free tier while traveling, a Together API key for Llama 4 Scout, an OpenRouter fallback), the difference between a 429 (rate limited, wait and retry), a 500 (backend broken, try a different provider), and a 529 (Anthropic overloaded, specific to one vendor) changes what the user or the agent should do next. The line 98 to 125 forwarder surfaces errorStatus and errorType both to the UI log and to the ACP session update stream, so downstream consumers can route on those values instead of guessing from a generic retry counter.

Is this something I can adopt in my own Claude Agent SDK project?

Yes. patched-acp-entry.mjs is 267 lines of plain ES module code that monkey-patches two well-defined prototype methods on ClaudeAcpAgent. The Fazm repo is MIT-licensed at github.com/mediar-ai/fazm, so you can lift the file, adapt the sessionUpdate types to whatever your UI consumes, and point your ACP agent binary at it instead of the default entry. The only SDK assumption is the ^0.29.2 minor version pinned at package.json line 15, which is the version that shipped in mid-April 2026 alongside the compact_boundary and rate_limit_event shapes the patch consumes.

Related guides on the same bridge process

Sibling

Open-source LLM release news, April 2026

The sibling guide that drills into compaction specifically and what happens when a 128K context saturates on Qwen 3 or Gemma 4.

Read

Same bridge

Open source AI projects & tools April 17 2026

The logStuckToolsOnInterrupt observability story on the same acp-bridge process, logging in-flight tool commands on interrupt.

Read

Cost

Playwright MCP token cost optimization

Where the cachedReadTokens vs cachedWriteTokens split from the prompt() patch gets spent in a real agent session.

Read

Open source LLM releases 2026 news: the ten-event plus nine-field runtime layer every recap leaves out.

What every top result for this keyword covers, and what every one of them misses

The patch, lifted verbatim.

The ten dropped events, each pinned to a line.

compact_boundary (line 65)

status (line 74)

task_started (line 79)

task_notification (line 88)

api_retry (line 98)

rate_limit_event (line 132)

tool_progress (line 156)

tool_use_summary (line 167)

compaction_start (line 185)

compaction_delta (line 191)

Ten dropped event types, one forwarder, one ACP client

The prompt() patch: nine fields every roundup skips.

Compaction, streaming, live.

What the user sees during a long Llama 4 Scout session against an OpenRouter backend

One prompt through the patched bridge, step by step

User selects an open source model (Llama 4 Scout via OpenRouter)

User sends a prompt. ClaudeAcpAgent.prompt() is invoked.

Stream arrives. query.next() returns items one by one.

A rate_limit_event arrives from OpenRouter's backend

A compaction content_block_start arrives mid-stream

The terminal result type arrives with total_cost_usd

prompt() returns. The patch at line 234 builds the augmented object.

Swift UI consumes both the sessionUpdate stream and the augmented prompt return

Grep it yourself.

Stock Claude Agent SDK vs the Fazm runtime layer

Why this is the spine of the product.

See the ten-event forwarder running live against any 2026 open source LLM.

Frequently asked questions

Related guides on the same bridge process

Open-source LLM release news, April 2026

Open source AI projects & tools April 17 2026

Playwright MCP token cost optimization

Comments (••)

Comments ()