Open source LLM 2026 / runtime layer

Open source LLM releases 2026 news: the ten-event plus nine-field runtime layer every recap leaves out.

Every 2026 open source LLM release gets benchmarked on parameter count, context window, and ELO. Llama 4 Scout 10M, Qwen 3 Apache 2.0 at 128K, Gemma 4 at 131K, DeepSeek V3.2, GLM-5.1 under MIT at 200K, Mistral Medium 3 open weights, Arcee Trinity 400B. What no roundup covers is what a consumer Mac agent has to do at runtime to make any of those models observable, per prompt, to a user sitting in front of a floating bar. That layer is a 267-line MIT patch against the stock @agentclientprotocol/claude-agent-acp entry point, it is the same file for every model the bridge speaks to, and it is the uncopyable thing this page is about.

M
Matthew Diakonov
11 min read
4.8from open source (MIT) desktop agent
267-line patched-acp-entry.mjs
10 dropped ACP events re-forwarded
5 usage fields + 4 _meta fields per prompt

Models covered in every 2026 roundup

Llama 4 Scout 10MLlama 4 Maverick 400BQwen 3 Apache 2.0Qwen 3 128KGemma 4 131KGemma 4 four variantsMistral Medium 3 open weightsDeepSeek V3.2DeepSeek 128KGLM-5.1 MITGLM-5.1 200KGLM-5.1 754B MoEArcee Trinity 400BArcee Apache 2.0OpenRouter routingGroq free tierTogether AIFireworksself-hosted vLLMAnthropic 0.29.2

What every top result for this keyword covers, and what every one of them misses

I searched the same terms. The first page is consistent: a table of 2026 open source releases, sorted by month, with four or five columns. Parameters, context window, license, provider stack, benchmark score. Llama 4 Scout ships with a 10M context window. Qwen 3 is Apache 2.0. GLM-5.1 is under MIT at 754B total with 200K context. Gemma 4 ships in four variants, all under Apache 2.0. DeepSeek V3.2 holds 128K. Mistral Medium 3 releases with open weights. Arcee Trinity targets enterprise self-hosting at 400B, Apache 2.0. Every roundup hits the same points.

None of them describes what happens inside a consumer agent when the user actually talks to one of those models. The model itself is only one of three things that shape a user session. The other two are the agent runtime (what the agent does with streaming events) and the bridge (what gets surfaced to the UI). Both of those are invisible in a spec-sheet roundup. Both of those are what Fazm actually ships.

The runtime layer for every model Fazm speaks to, open source or not, lives in one file. That file is the rest of this page.

0lines in patched-acp-entry.mjs
0dropped ACP event types re-forwarded
0usage + _meta fields added per prompt
0prototype methods monkey-patched

The patch, lifted verbatim.

This is the forwarder block that catches ten event types the stock ACP agent drops. Each branch is annotated with its exact line number in the MIT-licensed source, so you can open the file and read it yourself at acp-bridge/src/patched-acp-entry.mjs.

acp-bridge/src/patched-acp-entry.mjs
10 events

Every 2026 open source LLM roundup lists parameter count, context window, and benchmark score. None of them show the ten ACP session-update types a consumer agent has to re-forward to turn any of those models into a usable, observable runtime.

patched-acp-entry.mjs, lines 62 to 201

The ten dropped events, each pinned to a line.

You can open the file and read the comment next to each branch. The numbers below are not decorative, they are the actual line anchors in the shipping file at 267 lines total.

compact_boundary (line 65)

Emitted when the SDK finishes compacting an over-window context. Carries trigger (auto/manual) and pre_tokens. Forwarded so the UI can clear the compacting banner.

status (line 74)

Session status_change. Tells the UI whether the agent is idle, thinking, waiting on a tool, or errored. Stock ACP eats this.

task_started (line 79)

A long-running task has begun. Carries taskId and description. Pairs with task_notification updates.

task_notification (line 88)

Periodic progress update for an active task. Three fields: taskId, status, summary.

api_retry (line 98)

Backend retried after a 429, 500, 529, or network error. Surfaces HTTP status and error category plus attempt + max_retries.

rate_limit_event (line 132)

Provider rate-limit envelope. Eight fields including resetsAt, rateLimitType, utilization, isUsingOverage, surpassedThreshold.

tool_progress (line 156)

Tool call elapsed time in seconds. Lets the UI render this tool has been running for 8.2s without the call completing.

tool_use_summary (line 167)

Post-tool-batch summary with preceding tool_use_ids. Lets the UI coalesce a batch of tool calls into one line.

compaction_start (line 185)

Stream_event for content_block_start with type compaction. Fires when the model begins streaming a summary of old turns.

compaction_delta (line 191)

Each text chunk of the compaction summary, as it streams. Lets the UI render the compaction live instead of a spinner.

Ten dropped event types, one forwarder, one ACP client

compact_boundary
status
task_started
task_notification
api_retry
rate_limit_event
tool_progress
tool_use_summary
content_block_start (compaction)
compaction_delta
patched-acp-entry.mjs
Fazm floating bar
acpClient.sessionUpdate
Swift UI log

The prompt() patch: nine fields every roundup skips.

The second patch at lines 212 to 261 wraps the prompt() prototype method. The original result object gets five new usage fields and a four-field _meta block. The totalTokens formula at line 223 is not an estimate, it is the literal sum of input, cache-write, cache-read, and output token counts the SDK returns.

acp-bridge/src/patched-acp-entry.mjs

Compaction, streaming, live.

The third piece is the compaction-stream forwarder at lines 181 to 201. When a 2026 open-source model with a 128K context (Qwen 3, Gemma 4, DeepSeek V3.2) saturates mid-session, the SDK streams a summary of the earlier turns as a sequence of content_block_delta chunks with delta.type === compaction_delta. The patch forwards every chunk as a sessionUpdate so the floating bar can render the summary live instead of showing a spinner.

acp-bridge/src/patched-acp-entry.mjs

What the user sees during a long Llama 4 Scout session against an OpenRouter backend

The UI shows a spinner. Mid-session the spinner stays up for 14 seconds, then 32, then 48. The user does not know whether the agent is thinking, compacting the context, rate-limited by OpenRouter, or hung. At the end of the turn no per-prompt cost is reported. Cache-read vs cache-write tokens are never exposed. tool_progress events with elapsed_time_seconds are silently dropped. The terminal_reason on session exit is unavailable.

  • Spinner is the only signal
  • No per-prompt USD cost
  • No cache-read / cache-write split
  • Rate-limit events dropped
  • Compaction streams dropped
  • Tool elapsed time dropped

One prompt through the patched bridge, step by step

1

User selects an open source model (Llama 4 Scout via OpenRouter)

Fazm calls acpRequest("session/set_model", { sessionId, modelId }) at acp-bridge/src/index.ts line 1495 after the ACP session is created.

2

User sends a prompt. ClaudeAcpAgent.prompt() is invoked.

The patched version at lines 212 to 261 calls the original first, then inspects session._lastCostUsd which the wrapped query.next() populated during streaming.

3

Stream arrives. query.next() returns items one by one.

The patch at line 39 wrapped query.next(). Each item is inspected at line 43 (result type), line 62 (system type), line 132 (rate_limit_event), line 156 (tool_progress).

4

A rate_limit_event arrives from OpenRouter's backend

The line 132 forwarder packs rate_limit_info into a sessionUpdate with eight fields including resetsAt and utilization.

5

A compaction content_block_start arrives mid-stream

Line 185 forwards sessionUpdate: compaction_start. Subsequent compaction_delta chunks at line 191 stream the summary text live.

6

The terminal result type arrives with total_cost_usd

Line 46 computes the delta since the previous prompt: session._lastCostUsd = item.value.total_cost_usd - prevSessionCost.

7

prompt() returns. The patch at line 234 builds the augmented object.

usage carries 5 token fields (totalTokens = input + cacheWrite + cacheRead + output at line 223). _meta carries costUsd + terminalReason + lastApiError + errors.

8

Swift UI consumes both the sessionUpdate stream and the augmented prompt return

Floating bar renders compacting context..., per-prompt USD, cache-read ratio, rate-limit countdown, elapsed tool time. None of those are available without this patch.

Where each of the ten events enters the system

UserFazm Swift UIacp-bridgeClaude Agent SDKOpen-source backendclick send on promptsession/prompt over ACPoriginalPrompt(params)POST /v1/messages (Llama 4 Scout / Qwen 3 / etc.)stream_event: content_block_start type=compactionforwarded line 185 as sessionUpdate compaction_startrate_limit_event with 8 fieldsforwarded line 132 as sessionUpdate rate_limittool_progress elapsed_time_seconds=8.2forwarded line 156 as sessionUpdate tool_progressresult with total_cost_usd + usageprompt() augments with usage + _meta (line 234)sessionUpdate stream + augmented returnfloating bar: $0.0042, cache 61%, compacting, rate-limit 2m18s

Grep it yourself.

Every claim on this page is a grep away from verification. Here are the four commands.

acp-bridge verification

Stock Claude Agent SDK vs the Fazm runtime layer

FeatureStock @agentclientprotocol/claude-agent-acp ^0.29.2Fazm (patched-acp-entry.mjs, 267 lines, MIT)
Per-prompt USD cost shown to userNot surfaced by stock ACP_meta.costUsd from prompt() patch lines 212 to 261
Cache-read vs cache-write token splitNot exposedusage.cachedReadTokens and cachedWriteTokens at lines 239 to 240
Rate-limit countdown with resetsAtDropped silentlyrate_limit_event forwarder at line 132, eight fields
HTTP status + error category on retryGeneric retry counter onlyapi_retry forwarder at lines 98 to 125 with errorStatus + errorType
Compaction progress in real timeSession appears hungcompaction_start line 185, compaction_delta line 191
Tool elapsed time per callNot emittedtool_progress forwarder at line 156 with elapsed_time_seconds
Task lifecycle (started + notification)Droppedtask_started line 79, task_notification line 88
Structured terminal_reason on exitUnavailable_meta.terminalReason captured at line 53
Works for any model the SDK speaks toClosed to Anthropic modelsModel-agnostic, bridge wraps ClaudeAcpAgent prototype

Ten independently grep-verifiable claims

  • File exists: acp-bridge/src/patched-acp-entry.mjs, 267 lines (wc -l)
  • Build script copies it to dist: tsc && cp src/patched-acp-entry.mjs dist/patched-acp-entry.mjs
  • compact_boundary forwarder at line 65
  • status forwarder at line 74
  • task_started at line 79, task_notification at line 88
  • api_retry at line 98 with HTTP status + error category
  • rate_limit_event at line 132 with 8 fields including resetsAt
  • tool_progress at line 156, tool_use_summary at line 167
  • compaction content_block_start at line 185, compaction_delta at line 191
  • prompt() patch at lines 212 to 261: 5 usage + 4 _meta fields, totalTokens formula at line 223

Why this is the spine of the product.

Fazm's differentiator is that it runs on real macOS accessibility APIs instead of screenshot loops, and it works with any Mac app, not just the browser. But the model that consumes those AX trees is replaceable. Today it ships with claude-sonnet-4-6 as the default (acp-bridge/src/index.ts line 1245). Tomorrow, if someone wires Llama 4 Scout through an OpenRouter provider path, the AX tree flows into the same prompt(), the same streaming pipeline, the same patched-acp-entry.mjs.

The 267 lines are what makes the swap safe. If you change the model tomorrow to a 200K GLM-5.1, the cost meter still reads because _meta.costUsd is sourced from the SDK result. The rate-limit countdown still reads because rate_limit_event has a normalized shape. The compacting context banner still works because stream_event with content_block.type === compaction still arrives. The user experience survives the model swap because the observability layer lives below the model boundary.

That is the part of the story that every 2026 open source LLM release recap leaves out, and the part a shipping consumer desktop agent cannot skip.

See the ten-event forwarder running live against any 2026 open source LLM.

15-minute demo: patched-acp-entry.mjs in a live Fazm session, switching between Llama 4, Qwen 3, and Sonnet without touching the UI.

Book a call

Frequently asked questions

What are the major open source LLM releases in 2026?

The shortlist every roundup is tracking through April 2026: Meta Llama 4 Scout (MoE with a 10M context window) and Llama 4 Maverick (MoE, 400B total parameters), Alibaba Qwen 3 under Apache 2.0 with 128K context, Google Gemma 4 across four variants under Apache 2.0 with 131K context, Mistral Medium 3 with open weights, DeepSeek V3.2 at 128K, Zhipu GLM-5.1 under MIT at 200K and 754B MoE, and Arcee Trinity as a 400B Apache 2.0 model for enterprise self-hosting. Every roundup covers the same four columns: parameter count, context window, license, benchmark score.

What do the top open source LLM 2026 roundups miss?

They describe releases as a spec-sheet table. They do not describe what a desktop agent does at runtime when it speaks to those models through the Claude Agent SDK. Specifically: the default @agentclientprotocol/claude-agent-acp entry point drops ten kinds of session events on the floor (compaction, rate limits, tool progress, API retries, task lifecycle, status change) and never surfaces per-prompt cost in USD or cache-read/cache-write token splits. That is what shapes the actual user experience for any of the 2026 models, not the benchmark score.

Where in the Fazm source is the open source runtime observability layer?

One file: /Users/matthewdi/fazm/acp-bridge/src/patched-acp-entry.mjs. 267 lines total (wc -l). It imports ClaudeAcpAgent from @agentclientprotocol/claude-agent-acp and monkey-patches two prototype methods: createSession at line 20 (wraps query.next() so dropped stream events get re-forwarded as ACP sessionUpdates) and prompt at line 211 (augments the return value with usage and _meta). The package version is pinned at acp-bridge/package.json line 15 to @agentclientprotocol/claude-agent-acp ^0.29.2.

Which specific session events does the patch re-forward?

Ten, at specific line numbers you can open and verify: compact_boundary at line 65, status at line 74, task_started at line 79, task_notification at line 88, api_retry at line 98 (with the HTTP status code and error category, plus attempt and retry_delay_ms), rate_limit_event at line 132 (eight fields including resetsAt, rateLimitType, utilization, overageStatus, isUsingOverage, surpassedThreshold), tool_progress at line 156 (with elapsed_time_seconds), tool_use_summary at line 167, compaction content_block_start at line 185, and compaction_delta at line 191. The stock ACP agent drops all ten on the floor.

What does the prompt() patch actually surface?

Lines 212 to 261. For every prompt that returns successfully, the patch augments the result with a usage object carrying five fields: inputTokens, outputTokens, cachedReadTokens, cachedWriteTokens, totalTokens. The totalTokens formula at line 223 is input_tokens + cache_creation_input_tokens + cache_read_input_tokens + output_tokens. It also adds a _meta object with four fields: costUsd (the delta since the last prompt, computed at line 46 as total_cost_usd - prevSessionCost), terminalReason, lastApiError (the last captured HTTP status + error type), and errors (any structured errors from the SDK result).

Why does this matter more for open source LLMs than for Anthropic models?

It matters for both, but the signal-to-noise jump is bigger for open source models because the provider stack is heterogeneous. OpenRouter, Together AI, Fireworks, Groq, Lambda, self-hosted vLLM, each one emits rate-limit information and retry-after headers in a slightly different shape that the Claude Agent SDK normalizes into the rate_limit_event and api_retry types. Without the patch at lines 98 to 125 and 132 to 152, that normalization is useless, the events just get thrown away. With the patch, the Fazm floating bar can say Groq rate-limited you, resets in 42s on a Llama 4 Scout session the same way it would on a Sonnet session.

How do I verify the ten-event + nine-field claim?

Clone the Fazm app (the desktop source lives in /Users/matthewdi/fazm, the repo at github.com/mediar-ai/fazm). Then run: wc -l acp-bridge/src/patched-acp-entry.mjs (expect 267), grep -n sessionUpdate acp-bridge/src/patched-acp-entry.mjs (you will see exactly the sessionUpdate types listed in this page), grep -n _lastCostUsd acp-bridge/src/patched-acp-entry.mjs (expect three hits at lines 45, 216, 222 and the cleanup at 250). The usage field names and the totalTokens formula are visible in one continuous block at lines 217 to 223.

How is the patch invoked in the shipping Fazm binary?

acp-bridge/package.json build script is tsc && cp src/patched-acp-entry.mjs dist/patched-acp-entry.mjs, so the mjs ships alongside the compiled TypeScript. The bundled acp-bridge process calls runAcp() at line 264 after installing the two prototype patches, so every Claude Agent SDK session the bridge spawns inherits them. There is nothing conditional, the patch is on for every model selected via session/set_model, including the default claude-sonnet-4-6 set at acp-bridge/src/index.ts line 1245 and any open-source model the user configures via the provider-selection path.

Does the patch drop the stock ACP events, or does it run on top of them?

It runs on top. Line 38 captures the original query.next reference before overwriting it at line 39, and the patched implementation calls originalNext first at line 40, then inspects the returned item, forwards new sessionUpdates for the ten dropped event types, and finally returns the same item to the caller. The prompt() patch at line 213 is the same pattern: it calls originalPrompt first, then augments the return. If the patch had a bug, the original behavior would still execute.

What does the user see that would not exist without this patch?

Four concrete things in the floating bar UI during a long Fazm session against any LLM (open source or closed): one, per-prompt USD cost to four decimals, sourced from _meta.costUsd. Two, cache-read vs cache-write token splits, sourced from usage.cachedReadTokens and usage.cachedWriteTokens. Three, a rate-limit countdown (for example Resets in 2m 18s) when a provider sends a rate_limit_event, sourced from the line 132 forwarder with the resetsAt field. Four, compacting context… during long sessions, sourced from the line 185 and 191 forwarders.

What about api_retry? Why is the HTTP status code specifically called out?

Because in heterogeneous open source deployments (a local vLLM at home, a Groq free tier while traveling, a Together API key for Llama 4 Scout, an OpenRouter fallback), the difference between a 429 (rate limited, wait and retry), a 500 (backend broken, try a different provider), and a 529 (Anthropic overloaded, specific to one vendor) changes what the user or the agent should do next. The line 98 to 125 forwarder surfaces errorStatus and errorType both to the UI log and to the ACP session update stream, so downstream consumers can route on those values instead of guessing from a generic retry counter.

Is this something I can adopt in my own Claude Agent SDK project?

Yes. patched-acp-entry.mjs is 267 lines of plain ES module code that monkey-patches two well-defined prototype methods on ClaudeAcpAgent. The Fazm repo is MIT-licensed at github.com/mediar-ai/fazm, so you can lift the file, adapt the sessionUpdate types to whatever your UI consumes, and point your ACP agent binary at it instead of the default entry. The only SDK assumption is the ^0.29.2 minor version pinned at package.json line 15, which is the version that shipped in mid-April 2026 alongside the compact_boundary and rate_limit_event shapes the patch consumes.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.