New AI model releases, papers, and open source: June 3-4, 2026

Matthew Diakonov, Written with AI

Published June 16, 20269 min read

Two open-weight models landed across June 3 and June 4, 2026, and they sit at opposite ends of the size range. One is small enough to run on your laptop. The other is half a trillion parameters and was announced with a specific job in its own product copy: long-running agents. That second phrase is the one worth slowing down on, because a model built for agents that run for hours is only as long-running as the thing that holds its context.

Direct answer, verified June 16, 2026

June 3-4, 2026 brought two squarely open releases: Gemma 4 12B (Google DeepMind, June 3, Apache 2.0, an encoder-free multimodal model that runs on a 16 GB laptop) and NVIDIA Nemotron 3 Ultra (June 4, OpenMDW-1.1, a 550B total / 55B active hybrid Mamba-Transformer Mixture-of-Experts built for long-running agents). Both ship downloadable weights on Hugging Face.

Date	Model	From	Weights	License
June 3	Gemma 4 12B Encoder-free multimodal (text/image/audio/video), runs on a 16 GB laptop	Google DeepMind	Open	Apache 2.0
June 4	Nemotron 3 Ultra 550B total / 55B active hybrid Mamba-Transformer MoE, built for long-running agents	NVIDIA	Open	OpenMDW-1.1

Sources: Gemma 4 12B launch coverage and Nemotron 3 Ultra launch coverage.

The June 4 release names its own bottleneck

NVIDIA did not market Nemotron 3 Ultra as a chat model. The launch framing is precise: an open Mixture-of-Experts hybrid Mamba-Transformer for agentic reasoning, tool use, and long-context tasks, aimed at “long-running agents that plan, call tools, and reason across many turns.” The hybrid Mamba-Attention stack and the 550B/55B sparsity are there to make a model that stays coherent over a long run cheaper to serve.

Here is the part the spec sheet cannot fix. A model trained to reason across many turns has to actually be handed those turns at inference time. If the thing running the loop trims older messages to save tokens, or forgets the whole conversation when the app closes, the model never receives the long history it was built to use. The weights got better on June 4. Whether you feel it is decided one layer up, in the harness.

Where a long run actually breaks

Picture a single agent run of a dozen turns: read a file, call a tool, make a decision, keep going. The model is fine. The failure is structural, and it shows up at the turn where the harness decides the history is too long and quietly drops the earliest decision.

A long run, and the turn where context goes missing

That last step is the whole problem with running a long-horizon model through a harness that compacts. The model did not regress. It was handed an incomplete transcript and answered the question it was actually asked. A restart mid-run is the same failure with a harsher edge: the session is gone, not just trimmed.

The one setting that lets the same session hold an open model

Fazm is the harness, not the model. It wraps Claude Code and Codex through the Agent Client Protocol in a native macOS UI, and it is built so the two failure modes above do not happen: sessions survive a Mac restart and auto-restore with full history, and nothing auto-compacts, so the entire conversation stays live in context for the lifetime of the window. That is exactly the property a model “for long-running agents” needs from whatever runs it.

The concrete, checkable part is how you point that persistent session at a June 3-4 open model instead of a hosted one. There is a single field, customApiEndpoint, in SettingsPage.swift. When the value is a valid URL, the bridge sets one environment variable for the agent process:

ACPBridge.swift

Because the override is just ANTHROPIC_BASE_URL, anything that speaks the Anthropic API format works behind it: a local bridge serving Gemma 4 12B on your 16 GB laptop, a self-hosted Nemotron 3 Ultra endpoint, a corporate proxy, or a Copilot bridge. A raw OpenAI or Gemini key does not, by design, since the bridge expects the Anthropic request shape. The endpoint applies to Claude models, so you switch the backend without changing how you work.

Open weights in, a session that does not forget out

Why this maps cleanly onto these two releases

The two June 3-4 models pull in opposite directions, and the harness is what makes either one usable in a real workflow rather than a benchmark screenshot.

Gemma 4 12B is the local one. It is designed to run on roughly 16 GB of unified memory and integrates with MLX and llama.cpp, which are exactly the bridges you would stand up on a Mac. The endpoint override means you can run a fully local, private agent loop: your screen and mic stay on your machine, the model stays on your machine, and the session that drives it stays intact across restarts. Reaching past the terminal still works, because the agent drives your browser and native Mac apps through accessibility APIs rather than a code-only sandbox.

Nemotron 3 Ultra is the long one. A 550B model you would self-host or hit through a hosted endpoint, built to reason across many turns. Point the persistent session at it and the run can go long without the harness eating the early decisions. If you want to compare it against your usual model on the same task, fork the chat with one click: a new window opens with the full prior context and the original is left untouched, so the two runs start from an identical history instead of a re-typed prompt.

Run a same-week open model through a session that does not forget

Twenty minutes on pointing Fazm at a local or self-hosted endpoint, with persistence and no auto-compaction in the loop.

Questions people actually ask about this window

What new AI models released on June 3-4, 2026?

Two open-weight models at opposite ends of the size spectrum. On June 3 Google DeepMind released Gemma 4 12B, an encoder-free multimodal model (text, image, audio, video) under Apache 2.0 that runs locally on a 16 GB laptop. On June 4 NVIDIA released Nemotron 3 Ultra, a fully open 550-billion-parameter Mixture-of-Experts hybrid Mamba-Transformer with 55 billion active parameters per token, under the OpenMDW-1.1 license, with weights on Hugging Face. The continuous open-weight and preprint stream kept moving on both days, so the only fully current view is the live feeds, not a static list.

Is Nemotron 3 Ultra open source, and what is it built for?

Yes. NVIDIA released base, post-trained, and NVFP4 checkpoints openly under OpenMDW-1.1, with weights at nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 on Hugging Face. NVIDIA describes it as an open Mixture-of-Experts hybrid Mamba-Transformer for agentic reasoning, tool use, and long-context tasks, and is explicit about the target: 'long-running agents that plan, call tools, and reason across many turns.' That phrase is the whole story for anyone running an agent on their own machine.

Can Gemma 4 12B run on my laptop, and how does that fit a Mac agent?

Yes. Gemma 4 12B is a roughly 12-billion-parameter dense model designed to run on about 16 GB of VRAM or unified memory, with weights on Hugging Face and Kaggle and support for vLLM, SGLang, MLX, and llama.cpp. On a Mac that means you can serve it locally and point a Claude-compatible bridge at it. The bridge speaks the Anthropic API format, and Fazm's custom endpoint setting overrides ANTHROPIC_BASE_URL so the same persistent session talks to your local Gemma instead of a hosted model.

Why does 'built for long-running agents' depend on the harness, not the weights?

A model that can reason across many turns still needs every prior turn to be in context when it gets there. If your agent harness silently compacts older turns to save tokens, or loses the whole session when the app restarts, the model never sees the long history it was trained to use. Nemotron 3 Ultra changed the model layer on June 4. Whether you get long-running agency out of it depends on whether your harness keeps the session alive and the context intact across the run.

How does Fazm point a persistent session at one of these open models?

Fazm has a custom API endpoint setting (the customApiEndpoint preference in SettingsPage.swift). When you enter a valid URL, the ACP bridge sets env["ANTHROPIC_BASE_URL"] to that endpoint (ACPBridge.swift), so Claude-format requests route to any Anthropic-API-compatible bridge: a local LLM bridge serving Gemma 4 12B, a self-hosted Nemotron endpoint, a corporate proxy, or a GitHub Copilot bridge. Fazm does not send its built-in Anthropic key for those requests and does not count them against built-in credits.

Do I lose my chat if I restart while testing a same-week model?

No. Fazm persists sessions: chats survive a Mac restart and every window auto-restores with full conversation history. Nothing auto-compacts, so the full chat history stays live in context for the lifetime of the window. That matters when you are running a multi-day trial of a fresh open model, because the comparison does not quietly drift just because the harness dropped earlier turns overnight.

Were there other notable open releases that same week?

Yes, the open-weight wave around early June was broad. NVIDIA also launched Cosmos 3, an open foundation model for physical AI, on June 1, and Microsoft announced its closed MAI models at Build on June 2. For June 3-4 specifically, the two squarely open, generally-usable language and multimodal releases were Gemma 4 12B and Nemotron 3 Ultra. Dated windows are noisy, so treat any static list as a snapshot and check the live feeds for the current state.

Keep reading

Roundup

June 2-3, 2026: two closed models, one open, and the layer that decides if any help

The previous window: Microsoft's MAI models and Google's Gemma 4 12B, read through the harness that feeds them screen context.

8 minRead

Roundup

June 10-11, 2026 AI model releases, papers, and open source

The next dated window in the same series.

8 minRead

How it works

Routing Claude Code through a custom ANTHROPIC_BASE_URL endpoint

How the Anthropic base URL override works, and why it lets the same workflow point at a local or proxied model.

6 minRead

Deep dive

When Claude Code compacting quietly drops a decision

Why auto-compaction is the failure mode that breaks long-running agents, and what keeping full history live costs.

7 minRead

Fazm is open source at github.com/mediar-ai/fazm. Dated release windows are noisy; treat any static list as a snapshot and check the live model feeds for the current state.