New AI model releases, papers, and open source: June 6-7, 2026
The June 6-7 weekend itself did not bring a net-new flagship model. What it brought was the moment a model launched a few days earlier turned into something you could actually pull down and run: MiniMax M3, an open-weight model with a 1-million-token context window and computer-use built into its weights, with downloadable checkpoints and community quantizations spreading across the early-June days. That is the release worth slowing down on, because the headline that sells it (one million tokens of context) is the exact spec your agent harness is most likely to quietly take away from you.
June 6-7, 2026 was light on net-new flagship launches. The live open-weight story was MiniMax M3, which MiniMax released on June 1, 2026, and whose open weights and community GGUF quantizations rolled out across the following days so it could be run locally. It is a Mixture-of-Experts model under the minimax-community license.
Sources: M3 launch coverage (June 1), official model card, and community GGUF weights. Dated windows are noisy; check the live model feeds for the current state.
The headline spec is the one your harness is most likely to eat
M3 leads with a 1-million-token context window. On the model card that is a real number: the weights can attend across that much input. But a window is a ceiling, not a delivery guarantee. The amount of history the model actually receives at any given turn is decided by whatever drives the loop, and most agent runners are built to keep the token bill down by compacting older turns once the conversation gets long.
So you can download a model with a million-token window and still hand it a heavily-trimmed transcript at turn 40, because the harness decided turns 1 through 20 were old news. The model did not regress. It answered the question it was actually asked, with the context it was actually given. The window you paid for is sitting unused above the line the harness drew.
1M tokens on the model card vs. what the loop actually feeds it
A 1M-context model fed a trimmed transcript. Early turns are summarized away to save tokens, and a restart drops the session entirely.
- Model can attend to 1M tokens
- Harness sends a compacted slice
- Early decisions silently gone by turn 40
- App restart loses the whole run
What M3 actually asks of the thing running it
Read M3's feature list as a set of requirements it places on the harness, not just bragging rights. Each property the model ships is only usable end to end if the layer above it cooperates.
1M-token context
M3's headline number. It only matters end to end if the thing running the loop never trims older turns to save tokens. Fazm does not auto-compact, so the window stays full for the window's lifetime.
Native computer use
M3 ships desktop computer-use capability in its weights. Fazm supplies the actual surface: it drives your real browser and native Mac apps through accessibility APIs, not screenshots.
Open weights + GGUF
MiniMaxAI/MiniMax-M3 plus unsloth GGUFs (1-bit to BF16) for llama.cpp, Ollama, and LM Studio. Serve it locally, then route Fazm's session at that local endpoint.
Agentic coding focus
M3 is built for multi-turn tool use and coding agents. A coding agent that runs for an hour needs its early decisions still in context at turn 40. That is a persistence and no-compaction property, supplied by the harness.
The one setting that points a persistent session at your local M3
Fazm is the harness, not the model. It wraps Claude Code and Codex through the Agent Client Protocol in a native macOS UI, and it is built so the two failure modes above do not happen: sessions survive a Mac restart and auto-restore with full history, and nothing auto-compacts, so the entire conversation stays live in context for the lifetime of the window.
The concrete, checkable part is how you make that persistent session talk to an M3 you are serving yourself instead of a hosted model. There is a single field, customApiEndpoint, in SettingsPage.swift. When the value is a valid absolute http(s) URL, the bridge sets one environment variable for the agent process and disables the bundled key, so your weights and your key never leak to the wrong place:
Because the override is just ANTHROPIC_BASE_URL, anything that speaks the Anthropic request shape works behind it: a local bridge serving an M3 GGUF on your Mac, a self-hosted M3 endpoint, a corporate gateway, or a Copilot bridge. A raw OpenAI or Gemini key does not, by design, because the bridge expects the Anthropic format. And a malformed value (a bare localhost:8766 with no scheme, say) is rejected rather than silently dropped into the base URL, which would otherwise brick chat with an “Invalid URL” error on every turn.
Running M3 locally on a Mac was not a one-click weekend
Worth a caveat, because the roundups skip it. The official weights live at MiniMaxAI/MiniMax-M3 with SGLang, vLLM, and Transformers recipes, and the unsloth GGUF repo carries quantizations from 1-bit through BF16 for llama.cpp, Ollama, and LM Studio. But early in the window, llama.cpp support for M3 was preliminary and not yet in a released build, so a genuine local run meant compiling from a specific pull request, not just pulling a binary. Once you have a local endpoint up, the routing above is the same regardless of which runner you used to stand it up.
And the reach does not stop at the terminal. Whatever model you put behind the endpoint, Fazm's agent drives your actual browser and native Mac apps through accessibility APIs rather than screenshots, so an M3 you are serving locally can act across your desktop while its session stays intact across restarts and never gets compacted out from under it.
Run a same-week open model through a session that does not forget
Twenty minutes on pointing Fazm at a local or self-hosted M3 endpoint, with persistence and no auto-compaction in the loop.
Questions people actually ask about this window
What new AI models released on June 6-7, 2026?
That specific weekend was quiet for net-new flagship launches. The live open-weight story across the window was MiniMax M3, which MiniMax released on June 1, 2026, and whose downloadable weights plus community quantizations rolled out across the following days on Hugging Face. M3 is a roughly 428-billion-parameter Mixture-of-Experts model with about 23 billion parameters activated per token, a 1-million-token context window, native multimodal and desktop computer-use capability, and an agentic-coding focus, under the minimax-community license. Dated windows are noisy, so treat any static list as a snapshot and check the live model feeds for the current state.
Where can I actually download and run MiniMax M3?
The official weights are at MiniMaxAI/MiniMax-M3 on Hugging Face, with SGLang, vLLM, and Transformers recipes. For local runs, the unsloth/MiniMax-M3-GGUF repository hosts quantizations from 1-bit through BF16 that target llama.cpp, Ollama, and LM Studio. Note that llama.cpp support for M3 was preliminary at release and required building from a specific pull request rather than a tagged release, so a Mac local run early in the window meant compiling, not just pulling a binary.
Why does a 1-million-token context window depend on the harness, not the model?
A 1M-token window is a ceiling on what the model can attend to, not a guarantee that your tool feeds it that much. Most agent harnesses compact older turns once the running history gets long, to save tokens. If the harness compacts, the model never receives the long history the 1M window was built to hold, so you paid for context you do not actually get to use. The window is real; whether you feel it is decided one layer up, in whatever runs the loop.
How does Fazm point a persistent session at a locally-served MiniMax M3?
Fazm has a single custom-endpoint field, customApiEndpoint, in SettingsPage.swift. When the value is a valid absolute http(s) URL, ACPBridge.swift sets env["ANTHROPIC_BASE_URL"] to it, sets FAZM_CUSTOM_API_ENDPOINT to true, and swaps in a placeholder API key so Fazm's bundled Anthropic key is never sent to your proxy (ACPBridge.swift lines 2423-2433). So anything that speaks the Anthropic request shape works behind it: a local bridge serving an M3 GGUF, a self-hosted M3 endpoint, a corporate gateway, or a Copilot bridge. A malformed value is rejected and the bridge falls back to the default, rather than silently bricking chat.
If M3 is multimodal and does computer use, does that overlap with what Fazm does?
They sit at different layers. M3 is a model with computer-use capability in its weights. Fazm is the harness: it wraps Claude Code and Codex through the Agent Client Protocol in a native macOS UI, and reaches past the terminal by driving your real browser and native Mac apps through accessibility APIs rather than screenshots. You can point Fazm's loop at M3 (or any Anthropic-compatible endpoint) as the backend, and Fazm supplies the persistence, the non-compacting context, and the Mac-wide reach around it.
Do I lose my chat if I restart while testing a same-week model?
No. Fazm persists sessions: chats survive a Mac restart and every window auto-restores with full conversation history. Nothing auto-compacts, so the entire conversation stays live in context for the lifetime of the window. That matters during a multi-day trial of a fresh open model like M3, because the comparison does not quietly drift just because the harness dropped earlier turns overnight.
Were there other notable open releases around early June 2026?
The early-June open-weight wave was broad. NVIDIA released Nemotron 3 Ultra, an open 550B Mixture-of-Experts hybrid Mamba-Transformer for long-running agents, on June 4, and Google DeepMind released Gemma 4 12B on June 3. For the June 6-7 weekend specifically, the dominant story was MiniMax M3's open weights and community quantizations becoming widely runnable. We cover the June 3-4 window in a separate piece.
Keep reading
June 3-4, 2026: two open models, and the harness that decides if a long-running agent runs long
Gemma 4 12B and NVIDIA Nemotron 3 Ultra, read through the layer that feeds them context.
June 10-11, 2026 AI model releases, papers, and open source
The next dated window in the same series.
Routing Claude Code through a custom ANTHROPIC_BASE_URL endpoint
How the Anthropic base URL override works, and why it lets the same workflow point at a local or proxied model.
Fazm is open source at github.com/mediar-ai/fazm. Dated release windows are noisy; treat any static list as a snapshot and check the live model feeds for the current state.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.