Notes from inside a shipping consumer Mac agent

A 2026 launch-day benchmark for a new LLM, written from the point of view of a desktop agent that already has to live with it.

As of May 2026 the four public benchmarks that survive every frontier launch are GPQA Diamond, Humanity's Last Exam, SWE-Bench Verified, and LiveCodeBench. They tell you how smart the new model is on a clean prompt. They do not tell you what happens to a shipping desktop agent on the day the new model lands, which is the only thing a small business owner whose invoicing flow already routes through Claude actually cares about. This page proposes a four-axis integration benchmark for the launch-day case and shows the seven-line Swift function that scored the Opus 4.7 GA rename on April 22, 2026.

M
Matthew Diakonov
9 min read

Direct answer (verified 2026-05-11)

The benchmarks that track new LLM launches in 2026 are GPQA Diamond, Humanity's Last Exam, SWE-Bench Verified, and LiveCodeBench, with Arena Elo (LMSYS) as the human-preference aggregate. They are what Vellum, llm-stats.com, BenchLM, Siliconflow, and clickrank rank against. If you are shipping a desktop agent rather than reading a leaderboard, the launch-day number that matters more than any of those is the four-axis integration cost described in the section below: alias stability, tool-payload shape, persisted-preference migration cost, and binary-update requirement. The Opus 4.7 GA on April 22, 2026 failed axis one, held axis two, cost seven lines of Swift on axis three, and waived axis four.

The benchmarks every 2026 launch is scored on, and the one nobody scores

Pick any frontier launch from this year. Claude Opus 4.7 on April 22. Gemini 3.1 Pro through the spring. GPT-5.5. Llama 4 Scout and Maverick. The pages that rank for them all use roughly the same five numbers. GPQA Diamond for graduate-level science Q&A. Humanity's Last Exam for the hardest aggregate hold-out. SWE-Bench Verified for real GitHub issues. LiveCodeBench for contamination-resistant competitive programming. Arena Elo on top as the human-preference aggregate. Five numbers, a chart, a recommendation.

They are honest numbers. None of them are wrong. They are the right numbers if your job is to read a leaderboard and pick the smartest model for a clean chat prompt. They are the wrong numbers if your job is to keep a shipping agent working on the day the new model lands.

A desktop agent is the model plus an SDK plus a bridge subprocess plus a persisted user preference plus a UI label. Every frontier launch in 2026 has moved at least one of those parts. The integration-cost number lives in those moves, not in the reasoning score.

What every leaderboard ranks on

Public benchmarks that score the 2026 launches

GPQA Diamond

Graduate-level science Q&A held out of training dumps; the standard frontier-reasoning floor in 2026.

Humanity's Last Exam

Broad expert-curated holdout; the hardest aggregate score on most 2026 leaderboards.

SWE-Bench Verified

Real GitHub issues with verified pass/fail. Closest public proxy for agent coding capability.

LiveCodeBench

Rolling competitive-programming problems collected after each model's cutoff. Contamination-resistant.

Arena Elo

Human-preference aggregate from LMSYS. Sits on top of the others as the rough north star.

Integration cost (missing)

The four-axis launch-day number a shipping agent actually pays. Not on any public board as of May 2026.

The integration-cost number, on four axes

What follows is a candidate launch-day benchmark for a shipping agent. It is the four-axis number Fazm actually tracks per frontier release, written down so a future agent owner can run the same test on a model that launches next month. Each axis fails or holds binary; the migration cost is a count of lines.

Alias stability

Does the SDK report the new tier under the same identifier substring as last time? Opus 4.7 GA on 2026-04-22: failed (opus -> default).

Tool-payload shape

Do existing tool-use frames still deserialise without a schema bump? Across all four 2026 launches Fazm has eaten so far: held.

Migration cost

Lines of code to map yesterday's stored selection to today's available list. Opus 4.7: seven lines of Swift. Sonnet 4.6: zero.

Binary-update need

Can the new model appear in the picker without an App Store push? Yes, since Fazm 2.4.0 moved the picker to a dynamic list (2026-04-20).

FeaturePublic benchmarkIntegration-cost benchmark
What it scoresReasoning, coding, science Q&A, human preference on chat-shaped promptsWhether a shipping agent absorbs the launch without breaking persisted user state
Unit of measurementAccuracy, pass-rate, Elo pointsAlias stability, tool-payload diff, lines of migration code, binary-update requirement
Source of truthHeld-out evaluation sets, leaderboard runsThe launch-day diff between the SDK's new identifier list and the user's stored preference
What it predicts about launch dayWhich model will be the smartest, given a clean promptWhether your users will see the right pill, with no settings dialog, on the next launch
What it costs to runGPU hours per model per benchmarkOne JSON-RPC frame, one substring map, one log line per launch
How often it should runPer release, often once per leaderboardPer launch and per SDK minor version, because the rename can ride on either

April 22, 2026: Opus 4.7 GA against the four axes

Pick the densest launch in this year so far and walk it through. On April 22, 2026 Anthropic flipped Claude Opus 4.7 to general availability. GPQA went up. SWE-Bench went up. Arena Elo went up. Every public page that ranks 2026 launches updated to reflect the new top of the Opus tier.

Inside the @anthropic-ai/claude-agent-sdk, the Opus tier stopped being reported under any identifier containing the substring "opus". The new identifier was the literal alias default. That is the design choice that lets Anthropic swap underlying weights from 4.5 to 4.6 to 4.7 to whatever ships next without forcing every consumer app to push a release per swap. It is also the design choice that costs every shipping agent a migration.

On the four axes: alias stability failed, because the SDK now reports the tier under a different substring than the prior week. Tool-payload shape held; nothing in the agent-bridge protocol changed. Migration cost was seven lines of Swift. Binary-update requirement was waived, because Fazm 2.4.0 had shipped two days earlier with the model picker moved off a static array and onto the SDK's live list.

Launch day in real time

01 / 05

T+0s. SDK reports the new identifier list.

The acp-bridge subprocess at acp-bridge/src/index.ts receives models_available from the @anthropic-ai/claude-agent-sdk. On 2026-04-22 the Opus tier arrived under the alias 'default' for the first time. The bridge does not assume; it forwards the literal SDK payload to the Swift side.

The path the launch-day frame actually takes through a shipping desktop agent

Reading top to bottom: the SDK reports the new identifier list, the bridge forwards it, the Swift handler diffs against the persisted user preference, the substring rewrite fires, the new value lands back in UserDefaults. Six hops, five actors, one log line. The reason the user never has to repick the model after a frontier launch is that this round trip completes inside the first two seconds of the next app launch.

models_available frame, opus -> default migration

Anthropic SDKacp-bridgeChatProviderShortcutSettingsUserDefaultsmodels_available[default]JSON-RPC frameupdateModels(acpModels)read shortcut_selectedModel'opus' (legacy)normalizeModelId -> 'default'write 'default'ok, persisted
7 lines

Fixed Smart (Opus) model preference not persisting after app update, now correctly maps stored 'opus' to the new ACP model ID.

Fazm 2.4.2 changelog, 2026-04-26

The anchor fact: a doc comment that names the cause and a function that is its receipt

Open the file at /Users/matthewdi/fazm/Desktop/Sources/FloatingControlBar/ShortcutSettings.swift and scroll to line 180. The function is named normalizeModelId(_:). The body is three guarded substring checks. The third branch is the only one that matters for the April 22 launch: it returns "default" for any input that contains the substring "opus". The doc comment directly above reads, verbatim, ACP SDK v0.29+ uses default for Opus 4.7; migrate stored opus to match.

That is the entire migration. The changelog entry that documents it, on the same line of CHANGELOG.json for version 2.4.2 dated 2026-04-26, is one sentence: Fixed Smart (Opus) model preference not persisting after app update, now correctly maps stored opus to the new ACP model ID. One sentence in the user-facing changelog, seven lines in the source. That is the integration cost of the densest single-day frontier launch in 2026 measured at the level of a shipping consumer Mac agent.

The reason the function is seven lines and not seventy is the modelFamilyMap at lines 159 to 164. Four rows. The last two share an order index and a display label. Once the substring rewrite lands at "default", the picker renders Smart on the same pill it has always rendered, because the map says default belongs to the Opus tier with display order 2, identical to the legacy opus row above it.

The full list of identifiers a shipping agent has seen in 2026

The strings below are everything the bridge has reported across the April and May 2026 launches Fazm has eaten. They include two Anthropic tier renames, one context-variant suffix, and three OpenAI effort suffixes. None of them required an App Store push to land in the picker.

haiku-4-5sonnet-4-6sonnet-4-6[1m]opus-4-5opus-4-6opus-4-7defaultgpt-5.4/highgpt-5.5/highgpt-5.5-codex/high
0Lines of Swift to migrate Opus 4.7 (normalizeModelId)

Three substring branches plus a bracket-suffix capture.

0sSDK frame to pill update, roughly

Three hops: bridge, ChatProvider, ShortcutSettings. One log line on the way out.

0Rows in modelFamilyMap

haiku, sonnet, opus, default. Last two share a pill, which is why the rename is invisible.

What this benchmark is, and what it is not

The integration-cost number above is not a replacement for GPQA or SWE-Bench. It is not a claim that reasoning quality is irrelevant. A model that costs zero lines of migration but reasons worse is still a worse model. The argument is narrower: for a shipping agent, the integration test runs first, and a launch that costs you 200 lines of migration code on top of a 5-point GPQA gain is a worse launch than the headlines say it is.

Two things follow. First, public leaderboards have room for a sixth column. Alias stability across the launch is a reportable yes-or-no per frontier release; tool-payload diff is a reportable count; migration cost in lines of code is reportable if anyone runs a reference agent against the release. Vellum and llm-stats.com could publish those numbers next to the GPQA column without changing anything else about their pages.

Second, if you are building an agent and not a leaderboard, the only place to run this benchmark today is your own code. Move the model picker off a static array, then watch what the SDK does the next time a frontier model ships. The line count of your migration function on the day after will tell you more about whether your stack survived the launch than any public number on any leaderboard will.

Want to see the substring map in the wild?

Walk through Fazm with us. Bring the workflow you most want to keep working across the next frontier launch.

Frequently asked questions

Which benchmarks actually track new LLM launches in 2026?

The four that survived the 2025 to 2026 transition without going stale or getting saturated are GPQA Diamond (graduate-level science Q&A, resistant to training-set contamination), Humanity's Last Exam (a broad expert-curated set held out from public dumps), SWE-Bench Verified (real GitHub issues with verified test pass/fail), and LiveCodeBench (rolling competitive-programming problems collected after each model's cutoff). Arena Elo from LMSYS sits on top as the human-preference aggregate. Vellum, llm-stats.com, BenchLM, Siliconflow, and clickrank all rank frontier launches on some subset of those numbers. The pages do their job: they tell you which model is currently smartest at clean prompt-in, prompt-out reasoning.

What do those benchmarks miss for a shipping AI agent on launch day?

They score the model in isolation. A desktop agent is the model plus an SDK plus a bridge subprocess plus a persisted user preference plus a UI label. The thing that actually breaks on launch day is the seam between those parts. When Anthropic flipped Claude Opus 4.7 to GA on April 22, 2026, GPQA went up and SWE-Bench went up; the @anthropic-ai/claude-agent-sdk also renamed the Opus tier from a string containing 'opus' to the literal alias 'default'. Every shipping desktop agent that stored 'opus' in user defaults silently fell back to the next-best model that did exist in the new list. None of the four benchmarks above predict that, because none of them measure it.

What is the integration-cost number, and how is it measured for a desktop agent?

It is a four-axis launch-day number Fazm tracks per frontier release: alias stability (does the SDK report the new tier under the same identifier substring as last time?), tool-payload shape (do existing tool-use frames still deserialise, or has a field been renamed?), persisted-preference migration cost (how many lines of code does it take to map yesterday's stored value to today's available list?), and binary-update requirement (can the new model appear in the picker without an App Store push?). For the Opus 4.7 GA on April 22, 2026, alias stability failed, tool-payload shape held, migration cost was seven lines of Swift, and binary-update requirement was waived because Fazm 2.4.0 had already moved the picker to a dynamic list two days earlier.

Where is the seven-line function that absorbed the Opus 4.7 alias rename?

It is a static function called normalizeModelId(_:) at /Users/matthewdi/fazm/Desktop/Sources/FloatingControlBar/ShortcutSettings.swift lines 180 to 191. The body is three guarded substring checks. The first two map the legacy haiku/sonnet strings to their short aliases. The third branch, the one the 2.4.2 changelog refers to, is 'if modelId.contains("opus") { return "default" + bracketSuffix }'. The doc comment above the function reads verbatim 'ACP SDK v0.29+ uses default for Opus 4.7; migrate stored opus to match.' Seven lines, three substring branches, one preserved bracket suffix for context variants. That is the entire migration.

How is the new model list actually delivered to the app, and why does this matter for benchmarking?

Fazm 2.4.0 (shipped April 20, 2026, two days before Opus 4.7 GA) moved the model picker off a static array baked into the binary and onto a JSON-RPC frame named models_available emitted by the acp-bridge subprocess. The flow is three hops. The bridge listens to whatever the ACP SDK currently reports. The bridge emits models_available to the Swift app over the agent connection. The Swift handler at ChatProvider.swift line 1016 routes the frame into ShortcutSettings.updateModels, which calls recomputeAvailableModels and, if the persisted selection no longer matches a row, falls back through normalizeModelId and then a substring upgrade. The benchmark-relevant point is that this is the only architecture that absorbs a frontier launch without a binary release. Anything else has to ship a build per frontier model.

What model identifiers has the bridge actually seen across 2026 launches?

The bridge's DEFAULT_MODEL constant at acp-bridge/src/index.ts line 1920 is currently 'claude-sonnet-4-6'. The dynamic list reported by the SDK across April and May 2026 has included haiku-4-5, sonnet-4-6, sonnet-4-6[1m] (the 1M context variant), opus-4-5, opus-4-6, opus-4-7, and the post-rename 'default' alias. The Fazm modelFamilyMap at ShortcutSettings.swift lines 159 to 164 has four rows: ('haiku', 'Scary', 'Haiku', 0), ('sonnet', 'Fast', 'Sonnet', 1), ('opus', 'Smart', 'Opus', 2), ('default', 'Smart', 'Opus', 2). The last two rows share an order index, which is why a launch-day rename inside the Opus tier does not reorder the picker pills.

How would you benchmark a brand-new frontier model from a desktop-agent perspective in 2026?

Run the four-axis integration test before you run the reasoning test. First, attach the model to the agent bridge under whatever identifier the SDK reports, then send a tool-heavy prompt and inspect the tool-use frames for shape drift against the previous version. Second, write the stored preference 'old_alias', restart the app, and verify the picker migrates it without a settings dialog. Third, count the lines of code your migration required and add that to the launch's score. Fourth, only then run GPQA, SWE-Bench, LiveCodeBench, and Humanity's Last Exam on the model the bridge actually routes to. A model that scores 5 points higher on GPQA but costs you 200 lines of bridge code is a regression for a shipping agent.

Does this matter for any other agent shape, or is it specific to consumer desktop apps?

It matters anywhere the model identifier is persisted across a process boundary. A scheduled-task runner that resumes after a crash and re-reads the user's preferred model has the same exposure. A CRM automation that stores 'opus' in a row of a Postgres table has the same exposure. A browser extension that caches the picked model in chrome.storage has the same exposure. The reason it shows up most loudly on a desktop agent is that the time between SDK rename and user-visible breakage is roughly two seconds (the time from app launch to the first models_available frame). On a server-side stack you usually find out hours later when alerts fire on the wrong dashboard.

Why does Anthropic rename aliases at all? Wouldn't pinning to versioned IDs be safer?

Inside the agent SDK, 'default' is a stable handle that always resolves to whichever Opus tier is currently the highest-quality production model on the account. Renaming the live alias lets Anthropic swap underlying weights from Opus 4.5 to 4.6 to 4.7 to whatever ships next without forcing every consumer app to push a release per swap. The cost of that decision is borne once by every downstream app that holds a versioned legacy ID. The migration function above is the receipt: write the rewrite once, and every future swap inside the same tier is invisible to your users.

Where can a small business owner actually see this play out, without reading the source?

Open the floating bar with Cmd+Shift+Space and look at the pill on the right. It says Scary, Fast, or Smart. Behind those three words is a substring map that absorbed at least one frontier launch in April 2026 without anyone clicking anything. The reason you never had to repick the model after the Opus 4.7 GA is that the seven-line function rewrote your stored preference on the next launch, logged a line that reads 'ShortcutSettings: normalized selectedModel to default', and moved on. That is the integration test passing in the wild.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.