Latest AI model releases, papers, and open-source projects (May 25 to 26, 2026)

Two days, two centers of gravity. May 25 was a training-efficiency day at the paper layer (SkillOpt, Lens, the diffusion-transformer routing paper) and a guardrails-and-browser-mode day at the agent layer (Fazm v2.9.37). May 26 shifted into agent-and-world-model benchmarks at the paper layer (TriSplat, DVAO, WBench, Macaron-A2UI, CUA-Gym, Claw-Anything) and a structural refactor day at the agent layer: the standalone Claude OAuth window deleted at 11:33 PDT, the Connect Claude button deleted at 19:59 PDT, v2.9.41 tagged in between. This is the on-disk record of all of it.

M
Matthew Diakonov
12 min read
Direct answer (verified 2026-05-28)

May 25, 2026 trending papers (huggingface.co/papers/date/2026-05-25): SkillOpt (arxiv 2605.23904, 1.32k upvotes, Microsoft Research), Lens (2605.21573, 181, Microsoft), SciAtlas (2605.22878, 106, UCL), Rethinking Cross-Layer Information Routing in DiTs (2605.20708, 102, RTP-LLM), See What I Mean (2605.18018, 88, Alibaba Tongyi).

May 26, 2026 trending papers (huggingface.co/papers/date/2026-05-26): TriSplat (2605.26115, 186), DVAO (2605.25604, 128), WBench (2605.25874, 96), Macaron-A2UI (2605.24830, 77), Foundation Protocol (2605.23218, 75), Toward Native Multimodal Modeling (2605.25343, 38), Personalize-then-Store (2605.25535, 37), QUEST (2605.24218, 33), ParaVT (2605.20342, 33), ThriftAttention (2605.23081, 31), AutoResearch AI (2605.23204, 27), CUA-Gym (2605.25624, 26), Your Embedding Model is SMARTer Than You Think (2605.24938, 24), Claw-Anything (2605.26086, 21).

Open-source agent layer (github.com/m13v/fazm/commits/main): Fazm v2.9.37 on May 25 (agent guardrails for system-altering shell commands, third browser mode "No browser MCP, Assrt only"), Fazm v2.9.41 on May 26 (16 user-facing changes, including the verification-codes skill for SMS / 2FA codes from Messages SQLite, the cold-boot menu-bar fallback, the 10-minute inactivity cap removal, and the May 25 OAuth fix finally tagged). On May 26 at 11:33 PDT commit c304a3c3 deleted 116 of 120 lines from ClaudeAuthSheet.swift; at 19:59 PDT commit 08cf7332 replaced the "Connect Claude" peel with an inline model-picker entry.

The paper layer, by date

The Hugging Face dated papers index is the right primary source for "what got attention on day X". Counts on these pages shift slowly over weeks as new readers arrive, but ranks and arxiv IDs are stable. Both dates are linked at the top of each table.

May 25, 2026 (top 5)

TitlearxivUpvotesOrg
SkillOpt: Executive Strategy for Self-Evolving Agent Skills2605.239041.32kMicrosoft Research
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models2605.21573181Microsoft
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research2605.22878106University College London CS
Rethinking Cross-Layer Information Routing in Diffusion Transformers2605.20708102RTP-LLM
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding2605.1801888Alibaba Tongyi Lab

SkillOpt is the headline. The +19.1 point gain inside Claude Code on GPT-5.5 is the most useful number on its chart because it isolates what an optimized skill-document buys inside a harness that already does its own tool-use scaffolding. The full discussion of why is in the May 24 to 25 page, linked at the bottom of this one. Lens, SciAtlas, the diffusion-transformer routing paper, and the Alibaba Tongyi video-fine-grained-object paper are the rest of the trainability-and-efficiency cluster on this date.

May 26, 2026 (full trending list)

TitlearxivUpvotes
TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction2605.26115186
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning2605.25604128
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation2605.2587496
Macaron-A2UI: A Model for Generative UI in Personal Agents2605.2483077
Foundation Protocol: A Coordination Layer for Agentic Society2605.2321875
Toward Native Multimodal Modeling: A Roadmap2605.2534338
Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents2605.2553537
QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks2605.2421833
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video RL2605.2034233
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention2605.2308131
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery2605.2320427
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents2605.2562426
Your Embedding Model is SMARTer Than You Think2605.2493824
Claw-Anything: Benchmarking Always-On Personal Assistants2605.2608621

Three rows of this table point directly at the computer-use-agent category. Macaron-A2UI is a generative-UI model for personal agents, the right reference for any agent that has to render or restyle UI mid-conversation. CUA-Gym is a scaling-verifiable training environment specifically for computer-use agents. Claw-Anything is a benchmark for always-on personal assistants. If a roundup of this window stops at the headline papers (TriSplat, DVAO, WBench), it misses that May 26 is also a day three independent groups published evaluation infrastructure for the exact product category one of the open-source projects shipped a major release of on the same day.

The agent layer: one open-source project, two tagged releases, 80+ commits

The implementation-layer slice of this 48-hour window is dense enough to be worth reading on its own. The open-source macOS agent Fazm tagged v2.9.37 on May 25 and v2.9.41 on May 26. Between them, 80+ commits land on main. The most concrete way to see what happened is a per-hour walk of May 26.

2026-05-26 in five waves

  1. 1

    10:39 PDT — MCP audit logging

    Commit a131ea38 adds per-tool execution audit logging to the MCP layer so any future agent-side failure carries the tool call that produced it.

  2. 2

    11:33 PDT — standalone OAuth window deleted

    Commit c304a3c3 deletes 116 of 120 lines from ClaudeAuthSheet.swift. Claude OAuth now renders inline inside the PersonalAccountChooser used by Codex and ChatGPT.

  3. 3

    13:34-15:37 PDT — anonymous sign-in wave

    30+ commits land the signin-optional experiment: anonymous Firebase sign-in via REST, AuthService anon state, paywall pre-checkout sign-in gate, magic-link orphan cleanup, and visitor_id capture from the clipboard for the download A/B test.

  4. 4

    17:17 UTC — v2.9.41 tagged

    The May 26 release ships, packaging the May 25 OAuth fix, the verification-codes skill, the cold-boot menu-bar fallback, the 10-minute inactivity cap removal, and 13 other user-facing changes.

  5. 5

    19:46-19:59 PDT — Connect Claude peel deleted

    A second wave at the end of the day removes the pulsing Connect Claude button entirely. Claude is now an inline entry in the model picker (Claude - Connect...), mirroring how Codex shows up when not connected.

The structural commit of the day is the one at 11:33:09 PDT (c304a3c3). It deletes 116 of 120 lines fromDesktop/Sources/Chat/ClaudeAuthSheet.swift. The standalone Claude OAuth window is gone after this commit; Claude OAuth renders inline inside the PersonalAccountChooser sheet that Codex and ChatGPT already used. The intent is exactly what the wave at 19:46 to 19:59 PDT completes: stop treating Claude as a special peel that demands its own pulsing button, and start treating it as one entry among many in the model picker, just like Codex and ChatGPT.

v2.9.41 in 16 user-facing lines

The release manifest for v2.9.41 lists 16 changes. The headline ones are below; the rest are smaller stability fixes (Claude account picker hang when expired credentials sit in the keychain, Memory settings page spinning forever on an empty graph, Gemini Pro picked by default on the credit-exhausted chooser).

  • Cold-boot menu-bar fix. Menu-bar and dock clicks after launch-at-login on reboot used to do nothing. The floating bar now appears as a fallback when the main window scene has not yet bound its open-window action.
  • OAuth fix tagged. The May 25 fix that removed the rejected expires_in field from the Anthropic token-exchange body ships in a tagged release here. Anthropic now controls token lifetime server-side for user:sessions:claude_code and user:mcp_servers.
  • Verification-codes skill. A new bundled skill that lets the agent pull a fresh one-time code from Messages.app SQLite, WhatsApp, or Notification Center instead of asking the user to read it out. The order of operations is below.
  • 10-minute inactivity cap removed. Slow models (Gemini Pro, GPT-5 high) used to fail with "AI took too long to respond" after 10 minutes of streaming silence. Cancellation is now user-initiated only.
  • Onboarding founder chat. A "Stuck? Chat with the founder" button on the onboarding sheet so new users can message the founder without finishing setup first.
  • Signin-optional plumbing. Anonymous Firebase sign-in on first launch when the signin-optional PostHog flag picks treatment. Goes through the REST signUp endpoint so it works with the rest of the auth flow.
  • Failed-query analytics. The chat_agent_query_failed event now carries the user prompt and any partial AI response. Failed queries are no longer invisible to support investigations.

Anchor fact: the verification-codes skill is 73 lines and reads Messages SQLite directly

Of the 16 user-facing changes in v2.9.41, the verification-codes skill is the one that most clearly demonstrates the gap that the May 26 papers (Claw-Anything in particular) are measuring. The skill file lives atDesktop/Sources/BundledSkills/verification-codes.skill.mdin the Fazm repo and is 73 lines. The first concrete method it teaches the agent is a raw sqlite3 query against the user's local Messages database:

verification-codes.skill.md (Method 1: Messages.app via SQLite)

The reason this matters in the May 25 to 26 window is that it is a concrete example of capability no headless or cloud computer-use agent can offer: reading~/Library/Messages/chat.dbrequires Full Disk Access granted to the local app process, and reading WhatsApp requires Accessibility granted to the local app process. Both of those are macOS TCC entitlements that you cannot grant to a remote AWS process pretending to be your assistant. The skill is also defensive: the date filterdate/1000000000 + 978307200 > strftime('%s','now') - 600(Mac epoch + 10-minute window) prevents stale codes from being surfaced, and the skill explicitly forbids speaking the code aloud via TTS, masking middle digits if it must reference the code in chat, and refusing to type the code anywhere the user did not ask.

The order of operations across channels is below. The skill stops at the first hit unambiguously sent in the last 5 minutes.

verification-codes: order of operations

AgentMessages SQLiteWhatsApp MCPNotification CenterSELECT recent codes (chat.db)fresh code in last 10 min?if miss: whatsapp_read_messagescode in pinned chat?if miss: open and read bannerscode in recent notification?stop at first hit; never echo to TTS

The Claw-Anything paper from May 26 (arxiv 2605.26086) benchmarks always-on personal assistants on exactly this kind of cross-app retrieval task. The benchmark target and the skill above are different angles on the same problem.

Paper-layer and agent-layer, in the same 48 hours

The point of reading these two layers in parallel is not to claim Fazm is the answer to the May 26 benchmark papers; it is not, and v2.9.41 was not scored on any of them. The point is the alignment of timing. Three of the May 26 trending papers target the personal-agent and computer-use category. One open-source project in that category tagged a major release on the same day. The release added a skill that operationalizes one of the capability primitives those papers measure (cross-app retrieval of ephemeral codes). The same release deleted a piece of UI that another personal-agent paper, Macaron-A2UI, argues should not be hand-coded at all (the standalone OAuth window).

A roundup that reads only the paper layer for this window comes away thinking the computer-use category is one or two years away from being shippable. A roundup that reads only the changelog layer for this window comes away thinking the implementation work is happening in isolation. The honest read is that the gap between the two is narrower than either layer reads on its own, and you need both feeds open to see it.

What this window predicts for the next two weeks

Three short, falsifiable predictions, all checkable inside two weeks.

  • A second open-source Mac agent will inline its Claude OAuth flow into a unified model picker (the same shape Codex and ChatGPT already had) within 30 days. The May 26 refactor in Fazm is the cleanest worked example, and the structural argument (a per-provider modal window is a UI tax) generalizes.
  • At least one of the May 26 personal-agent benchmarks (CUA-Gym, Claw-Anything) gets a community contribution that scores a Mac-native always-on agent on the Messages SQLite + WhatsApp + Notification Center retrieval task. The skill above is concrete enough to map to a benchmark question.
  • Anthropic documents the May 25 OAuth-lifetime enforcement (still server-only as of May 28) in some form (API changelog, deprecation note, auth-docs update) within 14 days. The HTTP 400 response body is already self-documenting; the public docs catch up once the silent-enforcement pattern reaches a critical mass of broken clients.

Want a 15-minute walkthrough of the May 26 OAuth refactor on your own stack?

If you are running or evaluating a Claude Code wrapper and want to see the standalone-window-to-inline-picker refactor end to end, plus how the verification-codes skill is wired into the bundled skill loader, book a call. We open Fazm together, read the c304a3c3 diff, and walk the same discipline you would apply to your own client.

Frequently asked questions

What AI models, papers, and open-source projects shipped on May 25 to 26, 2026?

On the paper layer, May 25 was dominated by Microsoft Research's SkillOpt (arxiv 2605.23904, 1.32k Hugging Face upvotes), with Microsoft Lens (181), UCL's SciAtlas (106), Rethinking Cross-Layer Information Routing in DiTs (102), and Alibaba Tongyi's See What I Mean (88) rounding out the top five. May 26 shifted to agent-and-world-model benchmarks: TriSplat (186), DVAO (128), WBench (96), Macaron-A2UI (77), Foundation Protocol (75), with computer-use-agent papers Macaron-A2UI, CUA-Gym (26), and Claw-Anything (21) all trending on the same day. On the open-source agent layer, the macOS computer-use agent Fazm shipped v2.9.37 on May 25 (agent guardrails for system-altering shell commands, a third browser mode No browser MCP, Assrt only) and v2.9.41 on May 26 (16 user-facing changes packaging 80+ commits, the May 25 OAuth fix, the verification-codes skill, and the cold-boot menu-bar fix). At 11:33 PDT on May 26, Fazm also deleted 116 of 120 lines from its standalone Claude OAuth window in a structural refactor that inlined Claude auth into the model picker.

Which papers from May 25 to 26 actually matter for someone building or using an AI agent on macOS?

Five papers in this window are directly load-bearing for anyone running a computer-use agent. SkillOpt (May 25) shows that a compact natural-language skill document, treated as the trainable state of a frozen agent, buys +19.1 points inside Claude Code on GPT-5.5 over no-skill baselines; this generalizes to any Claude Code wrapper, including Fazm. Macaron-A2UI (May 26) is a generative-UI model for personal agents and is the right reference for any agent that has to produce or restyle UI on the fly. Foundation Protocol (May 26) is a coordination-layer proposal for inter-agent communication. CUA-Gym (May 26) is a scaling-verifiable training environment for computer-use agents (the exact category Fazm sits in). Claw-Anything (May 26) benchmarks always-on personal assistants. If your agent is a real computer-use agent, three of those four May 26 papers are pointing at you.

Why did the Hugging Face daily papers index swing from training papers to agent papers between May 25 and May 26?

Two days, two centers of gravity. May 25's top of mind was efficiency for training (SkillOpt's skill-document optimizer, Lens's training-efficiency rework for text-to-image, the diffusion-transformer routing paper). The next day skewed toward agentic and world-model evaluation: TriSplat for simulation-ready 3D reconstruction (any embodied agent benchmark needs simulation-ready geometry), WBench for interactive video world models, plus three explicitly agent-shaped benchmarks (Macaron-A2UI, CUA-Gym, Claw-Anything). The simple read is that the May 25 batch was about lowering the cost of producing a frontier model, and the May 26 batch was about measuring what current frontier models actually do once you put them inside an agentic harness. Both kinds of work landed in the same 48 hours on the same dated index.

What did the open-source agent layer ship on May 25 to 26, 2026?

The agent project with the densest commit history in this window is Fazm (github.com/m13v/fazm). It tagged two releases: v2.9.37 on May 25 (the guardrails-and-no-browser-MCP release) and v2.9.41 on May 26 (the OAuth-fix-and-onboarding release). Between the two, 80+ commits landed on the main branch. The structural commit of the window is c304a3c3 on May 26 at 11:33:09 PDT, which deletes 116 of 120 lines from Desktop/Sources/Chat/ClaudeAuthSheet.swift. The standalone Claude OAuth window is gone; auth now renders inline in the same PersonalAccountChooser sheet that Codex and ChatGPT already used. A second wave at 19:46 to 19:59 PDT removes the pulsing 'Connect Claude' button entirely, replacing it with an inline entry in the model picker (the model dropdown now shows 'Claude - Connect...' the way it already showed 'Codex - Connect...' before). The whole 48-hour cycle is a worked example of platform-side change driving structural UX change in a downstream client.

What is the verification-codes skill that shipped in Fazm v2.9.41 and why is it worth singling out?

The verification-codes skill is a 73-line skill file at Desktop/Sources/BundledSkills/verification-codes.skill.md that teaches the agent to pull a one-time code (SMS, iMessage, 2FA, OTP) from local channels instead of asking the user to read it. The order of operations is Messages.app via raw SQLite against ~/Library/Messages/chat.db (with a 10-minute freshness window to avoid stale codes), then WhatsApp via the bundled whatsapp MCP, then macOS Notification Center via the macos-use MCP. The reason this is interesting in the May 25 to 26 window is that it is a concrete example of work no headless or cloud computer-use agent can do: reading the Messages SQLite database requires Full Disk Access granted to the local app, and reading WhatsApp requires Accessibility granted to a local app. It is the kind of capability that only ships when the agent is a real native macOS process with the right TCC entitlements, and the Claw-Anything paper from the same week is essentially benchmarking the absence of capabilities like this in current always-on assistants.

How can I verify the Hugging Face counts and Fazm commit timestamps in this article?

Every number on this page traces to a primary source. The Hugging Face dated indexes are at huggingface.co/papers/date/2026-05-25 and huggingface.co/papers/date/2026-05-26; upvote counts on dated indexes shift slowly over weeks, but the ranks and arxiv IDs are stable. The Fazm commit log is at github.com/m13v/fazm/commits/main; filter by date 2026-05-25 to 2026-05-26 and you will see the 80+ commits. The release manifests are in CHANGELOG.json at the repo root (search for 2.9.37, 2.9.41, 2.9.42 entries). The standalone OAuth window deletion is commit c304a3c3 and you can run git diff c304a3c3^ c304a3c3 -- Desktop/Sources/Chat/ClaudeAuthSheet.swift to read the diff directly. The verification-codes skill file is at Desktop/Sources/BundledSkills/verification-codes.skill.md in the same repo.

Is the Fazm v2.9.41 release the same product event as the May 25 OAuth break covered in the May 24 to 25 roundup?

Same break, different stage. The May 25 entry was the moment of discovery: two commits at 17:50:10 and 17:52:10 PDT on May 25 land the fix in Fazm's acp-bridge module, removing the rejected expires_in field from the OAuth token-exchange body. Those commits stayed un-tagged through the rest of May 25 and most of May 26. v2.9.41 on May 26 is the tagged release that finally ships them to end users, alongside 15 other unrelated user-facing changes that accumulated during the gap. The 24-hour delay between commit and tagged ship is itself a signal: an aggregator that only watches release feeds learned about the May 25 OAuth policy change a full day later than an aggregator that watches commit subjects in real time.

Where do I look for May 25 to 26, 2026 events that no roundup will capture?

Three places, in order of decreasing aggregator coverage. First, dated trending feeds (huggingface.co/papers/date/2026-05-25, /2026-05-26) plus the major lab news pages. This catches SkillOpt, TriSplat, DVAO, the WBench category, and the v2.9.37 and v2.9.41 tags. Second, the CHANGELOG.json (or equivalent release manifest) of one or two open-source agent projects you actually run. This catches the verification-codes skill, the cold-boot menu-bar fix, the 10-minute inactivity cap removal, and other concrete user-facing improvements that arrive in the body of a release rather than the headline. Third, the git commit log of those same projects, read by date range. This is where the structural refactors live (the May 26 11:33 PDT standalone-window deletion, the 19:46 to 19:59 PDT Connect-Claude peel removal) and where you can see platform-side policy enforcement reflected as same-day client fixes across multiple repos (the May 25 OAuth break is the clean example).

Does Fazm itself help me check 'what shipped in the past 48 hours' from inside the app?

Yes, through the deep-research skill that auto-installs to ~/.claude/skills/deep-research/SKILL.md on first launch. The skill runs an 8-phase pipeline (Scope, Plan, Retrieve, Triangulate, Synthesize, Critique, Refine, Package) with parallel web searches, parallel research subagents, citation verification where possible, and a markdown plus HTML plus PDF report into a dated folder under ~/Documents. Because it runs as your local Claude Code, Codex, or Gemini agent on your machine, the answer is grounded against fresh searches from your IP, not a cached newsletter summary. Combined with the verification-codes skill from v2.9.41 and the persistent-session architecture, the dated-window question is something you can answer on a Tuesday morning without leaving the app.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.