New AI model releases, papers, and open source: June 3-4, 2026

M
Matthew Diakonov
9 min read

Two open-weight models landed across June 3 and June 4, 2026, and they sit at opposite ends of the size range. One is small enough to run on your laptop. The other is half a trillion parameters and was announced with a specific job in its own product copy: long-running agents. That second phrase is the one worth slowing down on, because a model built for agents that run for hours is only as long-running as the thing that holds its context.

Direct answer, verified June 16, 2026

June 3-4, 2026 brought two squarely open releases: Gemma 4 12B (Google DeepMind, June 3, Apache 2.0, an encoder-free multimodal model that runs on a 16 GB laptop) and NVIDIA Nemotron 3 Ultra (June 4, OpenMDW-1.1, a 550B total / 55B active hybrid Mamba-Transformer Mixture-of-Experts built for long-running agents). Both ship downloadable weights on Hugging Face.

DateModelFromWeightsLicense
June 3Gemma 4 12B
Encoder-free multimodal (text/image/audio/video), runs on a 16 GB laptop
Google DeepMindOpenApache 2.0
June 4Nemotron 3 Ultra
550B total / 55B active hybrid Mamba-Transformer MoE, built for long-running agents
NVIDIAOpenOpenMDW-1.1

Sources: Gemma 4 12B launch coverage and Nemotron 3 Ultra launch coverage.

The June 4 release names its own bottleneck

NVIDIA did not market Nemotron 3 Ultra as a chat model. The launch framing is precise: an open Mixture-of-Experts hybrid Mamba-Transformer for agentic reasoning, tool use, and long-context tasks, aimed at “long-running agents that plan, call tools, and reason across many turns.” The hybrid Mamba-Attention stack and the 550B/55B sparsity are there to make a model that stays coherent over a long run cheaper to serve.

Here is the part the spec sheet cannot fix. A model trained to reason across many turns has to actually be handed those turns at inference time. If the thing running the loop trims older messages to save tokens, or forgets the whole conversation when the app closes, the model never receives the long history it was built to use. The weights got better on June 4. Whether you feel it is decided one layer up, in the harness.

Where a long run actually breaks

Picture a single agent run of a dozen turns: read a file, call a tool, make a decision, keep going. The model is fine. The failure is structural, and it shows up at the turn where the harness decides the history is too long and quietly drops the earliest decision.

A long run, and the turn where context goes missing

YouHarnessModelTurn 1: decide architecturefull history + turn 1plan recordedturns 2-9: tools, editsauto-compact: drop turn 1turn 10 without the decisionre-derives, contradicts turn 1

That last step is the whole problem with running a long-horizon model through a harness that compacts. The model did not regress. It was handed an incomplete transcript and answered the question it was actually asked. A restart mid-run is the same failure with a harsher edge: the session is gone, not just trimmed.

The one setting that lets the same session hold an open model

Fazm is the harness, not the model. It wraps Claude Code and Codex through the Agent Client Protocol in a native macOS UI, and it is built so the two failure modes above do not happen: sessions survive a Mac restart and auto-restore with full history, and nothing auto-compacts, so the entire conversation stays live in context for the lifetime of the window. That is exactly the property a model “for long-running agents” needs from whatever runs it.

The concrete, checkable part is how you point that persistent session at a June 3-4 open model instead of a hosted one. There is a single field, customApiEndpoint, in SettingsPage.swift. When the value is a valid URL, the bridge sets one environment variable for the agent process:

ACPBridge.swift

Because the override is just ANTHROPIC_BASE_URL, anything that speaks the Anthropic API format works behind it: a local bridge serving Gemma 4 12B on your 16 GB laptop, a self-hosted Nemotron 3 Ultra endpoint, a corporate proxy, or a Copilot bridge. A raw OpenAI or Gemini key does not, by design, since the bridge expects the Anthropic request shape. The endpoint applies to Claude models, so you switch the backend without changing how you work.

Open weights in, a session that does not forget out

Gemma 4 12B
Nemotron 3 Ultra
Claude Pro / Max
Fazm
Survives restart
No auto-compaction
One-click fork

Why this maps cleanly onto these two releases

The two June 3-4 models pull in opposite directions, and the harness is what makes either one usable in a real workflow rather than a benchmark screenshot.

Gemma 4 12B is the local one. It is designed to run on roughly 16 GB of unified memory and integrates with MLX and llama.cpp, which are exactly the bridges you would stand up on a Mac. The endpoint override means you can run a fully local, private agent loop: your screen and mic stay on your machine, the model stays on your machine, and the session that drives it stays intact across restarts. Reaching past the terminal still works, because the agent drives your browser and native Mac apps through accessibility APIs rather than a code-only sandbox.

Nemotron 3 Ultra is the long one. A 550B model you would self-host or hit through a hosted endpoint, built to reason across many turns. Point the persistent session at it and the run can go long without the harness eating the early decisions. If you want to compare it against your usual model on the same task, fork the chat with one click: a new window opens with the full prior context and the original is left untouched, so the two runs start from an identical history instead of a re-typed prompt.

Run a same-week open model through a session that does not forget

Twenty minutes on pointing Fazm at a local or self-hosted endpoint, with persistence and no auto-compaction in the loop.

Questions people actually ask about this window

What new AI models released on June 3-4, 2026?

Two open-weight models at opposite ends of the size spectrum. On June 3 Google DeepMind released Gemma 4 12B, an encoder-free multimodal model (text, image, audio, video) under Apache 2.0 that runs locally on a 16 GB laptop. On June 4 NVIDIA released Nemotron 3 Ultra, a fully open 550-billion-parameter Mixture-of-Experts hybrid Mamba-Transformer with 55 billion active parameters per token, under the OpenMDW-1.1 license, with weights on Hugging Face. The continuous open-weight and preprint stream kept moving on both days, so the only fully current view is the live feeds, not a static list.

Is Nemotron 3 Ultra open source, and what is it built for?

Yes. NVIDIA released base, post-trained, and NVFP4 checkpoints openly under OpenMDW-1.1, with weights at nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 on Hugging Face. NVIDIA describes it as an open Mixture-of-Experts hybrid Mamba-Transformer for agentic reasoning, tool use, and long-context tasks, and is explicit about the target: 'long-running agents that plan, call tools, and reason across many turns.' That phrase is the whole story for anyone running an agent on their own machine.

Can Gemma 4 12B run on my laptop, and how does that fit a Mac agent?

Yes. Gemma 4 12B is a roughly 12-billion-parameter dense model designed to run on about 16 GB of VRAM or unified memory, with weights on Hugging Face and Kaggle and support for vLLM, SGLang, MLX, and llama.cpp. On a Mac that means you can serve it locally and point a Claude-compatible bridge at it. The bridge speaks the Anthropic API format, and Fazm's custom endpoint setting overrides ANTHROPIC_BASE_URL so the same persistent session talks to your local Gemma instead of a hosted model.

Why does 'built for long-running agents' depend on the harness, not the weights?

A model that can reason across many turns still needs every prior turn to be in context when it gets there. If your agent harness silently compacts older turns to save tokens, or loses the whole session when the app restarts, the model never sees the long history it was trained to use. Nemotron 3 Ultra changed the model layer on June 4. Whether you get long-running agency out of it depends on whether your harness keeps the session alive and the context intact across the run.

How does Fazm point a persistent session at one of these open models?

Fazm has a custom API endpoint setting (the customApiEndpoint preference in SettingsPage.swift). When you enter a valid URL, the ACP bridge sets env["ANTHROPIC_BASE_URL"] to that endpoint (ACPBridge.swift), so Claude-format requests route to any Anthropic-API-compatible bridge: a local LLM bridge serving Gemma 4 12B, a self-hosted Nemotron endpoint, a corporate proxy, or a GitHub Copilot bridge. Fazm does not send its built-in Anthropic key for those requests and does not count them against built-in credits.

Do I lose my chat if I restart while testing a same-week model?

No. Fazm persists sessions: chats survive a Mac restart and every window auto-restores with full conversation history. Nothing auto-compacts, so the full chat history stays live in context for the lifetime of the window. That matters when you are running a multi-day trial of a fresh open model, because the comparison does not quietly drift just because the harness dropped earlier turns overnight.

Were there other notable open releases that same week?

Yes, the open-weight wave around early June was broad. NVIDIA also launched Cosmos 3, an open foundation model for physical AI, on June 1, and Microsoft announced its closed MAI models at Build on June 2. For June 3-4 specifically, the two squarely open, generally-usable language and multimodal releases were Gemma 4 12B and Nemotron 3 Ultra. Dated windows are noisy, so treat any static list as a snapshot and check the live feeds for the current state.

Fazm is open source at github.com/mediar-ai/fazm. Dated release windows are noisy; treat any static list as a snapshot and check the live model feeds for the current state.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.