New AI model releases, papers, and open source: June 2-3, 2026

Two storylines hit this two-day window: Microsoft shipped two closed models on the 2nd, Google open-sourced one on the 3rd. Every other writeup stops at the benchmark line. The part worth your time is what the one open release actually needs from your machine to be useful, and why that part lives in your agent loop, not in the weights.

M
Matthew Diakonov
7 min read
Direct answer, verified June 16, 2026

In the June 2-3, 2026 window, three named models shipped. On June 2, at Build 2026, Microsoft announced MAI-Thinking-1 (reasoning) and MAI-Code-1-Flash (coding), both closed weights. On June 3, Google DeepMind released Gemma 4 12B, an open-weight encoder-free multimodal model under Apache 2.0. For the exact set on any given day, the registry is the live feed, not an article: Hugging Face new models and the llm-stats updates log.

What shipped, at a glance

Three named models across two days, split clean down the open/closed line. The dates are the announcement dates from each vendor.

DateModelWhat it isWeights
June 2MAI-Thinking-1
Microsoft
Reasoning (sparse MoE)Closed
June 2MAI-Code-1-Flash
Microsoft
Coding (Copilot)Closed
June 3Gemma 4 12B
Google DeepMind
Multimodal (text/image/audio/video)Open

Two storylines, one two-day window

These two days are a tidy illustration of how releases actually arrive. The closed models came down a keynote-and-product channel. The open model came down a download-and-blog channel. They never sit in the same place, which is why a single dated list is always a photograph of a moving thing.

June 2: Microsoft's two closed MAI models

At Build 2026 Microsoft announced MAI-Thinking-1, its first in-house reasoning model (a sparse MoE, ~35B active params, 256K context, private preview on Foundry), and MAI-Code-1-Flash, an efficient coding model rolling out inside GitHub Copilot. Both trained on commercially licensed data, both closed weights. You consume them as a hosted service, not a download.

June 3: Google's open Gemma 4 12B

A dense 12B decoder-only model that takes text, image, audio, and video with no separate encoder, under Apache 2.0. Runs on a 16 GB laptop. This is the one you can actually pull down and serve yourself.

The axis that matters for a desktop agent

Open vs closed decides whether you can self-host. But whether either kind helps your day depends on the agent loop you run it in: what tools it has, whether it can reach past the terminal, and whether it keeps your full context across a long session.

Where each kind surfaced

The open weight landed on Hugging Face and the developer blog; the closed pair landed in a Build keynote and a Copilot rollout. Same two-day window, two completely different distribution channels, which is exactly why no single list captures a date.

The one you can actually pull down: Gemma 4 12B

Gemma 4 12B is the release in this window that matters if you want to run something yourself. It is a single dense 12-billion-parameter decoder-only transformer that handles text, images, audio, and video natively, and the interesting design choice is that it drops the separate encoders most multimodal models carry. Per the June 3 launch coverage, vision and audio flow straight into the LLM backbone: a small 35M vision embedder does a single matrix multiplication per image patch, and raw 16 kHz audio is sliced into 40 ms frames and linearly projected into the same embedding space as text tokens. No 550M vision encoder, no 300M audio encoder bolted on the side.

0B
Dense parameters
0 GB
RAM to run locally
0M
Vision embedder
0 ms
Audio frame size

What the June 3 launch coverage actually claims

  • Open weights, publicly downloadable, Apache 2.0 license
  • Dense decoder-only, not a mixture-of-experts
  • Native text, image, audio, and video with no separate encoder
  • First mid-sized Gemma to take audio input
  • Runs on a 16 GB laptop (VRAM or unified memory)
  • Full benchmark tables were not in the initial launch materials

Specs per the June 3 release writeup and the Gemma developer guide. Google reports it performs near the 26B mixture-of-experts model at under half the memory footprint, but did not publish the full tables on day one, so treat the headline comparison as a vendor claim until independent numbers land.

An encoder-free model reads pixels. On a Mac, that is the easy half.

Here is the part no release roundup connects. Gemma 4 12B can take an image of your screen and reason about it. That sounds like the whole game for a desktop agent, but it is only half. A raw screenshot tells a model what things look like; it does not reliably tell it which rectangle is a clickable button, what that button is labeled, or where to click so the click actually lands. Pixels are lossy about exactly the structure an agent needs to act. On macOS that structure already exists in the accessibility tree, and the harness that hands it to the model is what decides whether a multimodal release helps your day.

Verifiable in the source

In fazm the agent does not get one or the other; it gets both. In Desktop/Sources/Providers/ChatToolExecutor.swift there is a capture_screenshot tool (the executor dispatches it at executeCaptureScreenshot, gated on Screen Recording permission), and alongside it the same file wires an accessibility permission path that flips appState.hasAccessibilityPermission so the agent can act on real UI elements rather than guessing pixel coordinates off the screenshot. So the moment you test a model like Gemma 4 12B in fazm, it is being fed a real screen image and the accessibility structure under it.

That distinction is exactly why the model layer changing on June 3 does not automatically change your outcomes. A better vision model with a screenshot-only harness still flails on a click target it cannot localize. The same model with an accessibility-grounded harness can read the labeled element directly. The release gives you a sharper eye; the harness decides whether the hand can reach.

How to decide if a same-week release is worth keeping

A playground tells you almost nothing about whether a model is useful for your work. A model that looks sharp on a clean prompt can fall apart the moment it carries your real files, your constraints, and the dead ends you already ruled out. The only honest test is to put it behind the same agent loop, with the same tools and the same running context, on the surface you actually work on.

Two fazm properties make that test hold over the days it takes to form a real opinion. The agent loop reaches past the terminal: through macOS accessibility APIs and the screenshot tool above, it drives your actual browser, native Mac apps, and Google Workspace, so you evaluate a model on your full surface rather than a code-only sandbox. And fazm does not auto-compact; the full chat history stays live in context for the lifetime of the window, and sessions survive a Mac restart with every window auto-restored. A trial of a June 3 open weight can span a working week without the comparison quietly drifting because a harness silently dropped earlier turns. It is fully local and open source, so nothing about the screen or mic you feed a local model leaves your machine.

Want to put a June release through your own Mac workflow?

Book a short call and I will walk through serving a fresh open weight locally and testing it inside an agent loop that reaches your browser and native apps, not just a terminal.

Questions about the June 2-3, 2026 releases

What new AI models released on June 2-3, 2026?

Two dated windows, three models. On June 2 at Build 2026 Microsoft announced MAI-Thinking-1 (its first in-house reasoning model, a sparse mixture-of-experts with roughly 35 billion active parameters and a 256K context window, in private preview on Microsoft Foundry) and MAI-Code-1-Flash (an inference-efficient coding model that began rolling out across GitHub Copilot tiers). Both are closed weights. On June 3 Google DeepMind released Gemma 4 12B, an open-weight, encoder-free multimodal model under Apache 2.0. The continuous open-weight and preprint stream kept moving on both days, so the only fully current view is the live feeds, not a static list.

Is Gemma 4 12B open source, and under what license?

The weights are open and publicly downloadable under the Apache 2.0 license, per the June 3, 2026 release coverage. Microsoft's two MAI models from June 2 are not open: MAI-Thinking-1 shipped in private preview on Foundry and MAI-Code-1-Flash rolled out inside GitHub Copilot, neither with downloadable weights.

What is actually new about Gemma 4 12B?

It is a dense 12-billion-parameter decoder-only transformer that processes text, images, audio, and video natively, with no separate vision or audio encoder. The launch writeup describes vision and audio flowing straight into the LLM backbone: a 35M vision embedder doing a single matrix multiplication per image patch, and raw 16 kHz audio sliced into 40 ms frames linearly projected into the same embedding space as text tokens. It is the first mid-sized Gemma with native audio input, it runs on a 16 GB laptop, and Google says it performs near the 26B mixture-of-experts model at under half the memory footprint (full benchmark tables were not in the initial launch materials).

Were Microsoft's June 2 MAI models open source?

No. Both were closed releases announced at Build 2026. MAI-Thinking-1 went to private preview on Microsoft Foundry, and MAI-Code-1-Flash shipped as a hosted coding model inside GitHub Copilot across the Free, Pro, Pro+, and Max tiers. Closed-weight launches are announced by the vendor's own newsroom, which is why they show up in a different place from open weights on Hugging Face.

Can Gemma 4 12B run on my laptop?

Yes, that is the headline. It is designed to run locally with roughly 16 GB of VRAM or unified memory, so a regular laptop with 16 GB can serve it. Being a single dense model rather than a mixture-of-experts is part of why the memory footprint stays in laptop range while keeping all four modalities.

Why does an encoder-free vision model not automatically make a desktop agent more useful?

Because feeding it screen context well is a harness problem, not a weights problem. A model that can read an image still needs the right image, and pixels alone do not tell it which element is a button, what its label is, or where to click reliably. On a Mac, the accessibility tree carries that structure directly. The model layer changed on June 3; whether it helps you depends on what your agent loop hands it.

How do I test a fresh open model like Gemma 4 12B inside a real Mac workflow?

Run it locally, then drive your real agent loop with it rather than poking at it in a playground. In fazm the agent loop reaches past the terminal through macOS accessibility APIs and a real screenshot tool, so you evaluate the model on the full surface you actually work on (browser, native Mac apps, Google Workspace), not a code-only sandbox. Sessions survive a restart and nothing auto-compacts, so a multi-day trial of a same-week release does not quietly drift because the harness dropped earlier turns.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.