Open Source AI Model Release Roundup: April 2026

Matthew Diakonov··10 min read

Open Source AI Model Release: April 2026

April 2026 has been one of the most active months for open source AI model releases in recent memory. We have tracked every significant model drop, from Meta's LLaMA 4 family to Alibaba's Qwen 3, so you can evaluate what to run locally, what to serve in production, and what to skip.

April 2026 Model Releases at a Glance

| Model | Organization | Parameters | Context Window | License | Release Date | |---|---|---|---|---|---| | LLaMA 4 Scout | Meta | 17B (16 experts) | 10M tokens | Llama 4 Community | April 5 | | LLaMA 4 Maverick | Meta | 17B (128 experts) | 1M tokens | Llama 4 Community | April 5 | | Qwen 3 235B | Alibaba | 235B (MoE) | 128K tokens | Apache 2.0 | April 8 | | Qwen 3 32B | Alibaba | 32B (dense) | 128K tokens | Apache 2.0 | April 8 | | Qwen 3 8B | Alibaba | 8B (dense) | 128K tokens | Apache 2.0 | April 8 | | Mistral Medium 3 | Mistral AI | 73B | 128K tokens | Apache 2.0 | April 10 | | DeepSeek-R2 Lite | DeepSeek | 8B | 64K tokens | MIT | April 7 | | Gemma 3 27B | Google | 27B | 128K tokens | Gemma License | April 3 | | Cohere Command A | Cohere | 111B (MoE) | 256K tokens | CC-BY-NC 4.0 | April 9 | | Phi-4 Mini | Microsoft | 3.8B | 128K tokens | MIT | April 2 |

Meta's LLaMA 4 Family

Meta released two models in the LLaMA 4 series on April 5. Both use a mixture-of-experts (MoE) architecture, which means parameter counts are high but the active parameters per token stay manageable.

LLaMA 4 Scout (17B active, 16 experts)

Scout is the workhorse model. With 17B active parameters drawn from 16 expert modules, it fits on a single 48GB GPU (A6000 or better) and supports a staggering 10 million token context window. In practice, you will rarely need all 10M tokens, but the architecture handles long document retrieval and code repository analysis without the quality degradation you see in models stretched beyond their training distribution.

On MMLU-Pro, Scout scores 74.3, placing it ahead of Qwen 2.5 72B and competitive with Gemini 2.0 Flash. The real advantage is throughput: serving Scout with vLLM on two A100s yields roughly 1,800 tokens per second for batch inference.

LLaMA 4 Maverick (17B active, 128 experts)

Maverick uses the same 17B active parameters but draws from 128 experts instead of 16. The result: better benchmark scores (MMLU-Pro 78.8) at the cost of higher memory requirements. You need at minimum 256GB of combined GPU memory, which typically means four A100 80GB cards.

Maverick is the model to pick when accuracy matters more than serving cost. For most teams running inference on their own hardware, Scout is the practical choice.

LLaMA 4 MoE ArchitectureInput Tokensbatch of NRoutertop-2 selectionExpert 1Expert 2Expert NScout: N=16Maverick: N=128SumoutputOnly 2 experts active per token = 17B active params regardless of total expert count

Alibaba's Qwen 3 Series

Alibaba released the full Qwen 3 lineup on April 8, spanning from 0.6B to 235B parameters. The standout feature across the entire series is hybrid thinking: every Qwen 3 model can switch between a "thinking" mode (chain-of-thought reasoning with internal scratchpad) and a "non-thinking" mode (direct response) based on a system prompt toggle.

Qwen 3 235B (MoE)

The flagship model uses 22B active parameters out of 235B total. On MMLU-Pro it scores 79.4, on GPQA Diamond 71.1, and on LiveCodeBench 70.7. These numbers put it in the same tier as GPT-4o and Claude Sonnet 4 on most reasoning benchmarks.

The Apache 2.0 license is the real story here. Unlike LLaMA 4's community license (which restricts use for companies with over 700 million monthly active users), Qwen 3 235B has no usage restrictions at all.

# Serve Qwen 3 235B with vLLM (requires ~120GB VRAM across GPUs)
vllm serve Qwen/Qwen3-235B-A22B \
  --tensor-parallel-size 4 \
  --max-model-len 131072 \
  --enable-prefix-caching

Qwen 3 32B and 8B (Dense)

For teams that want simpler deployment, the dense models skip the MoE routing entirely. Qwen 3 32B fits on a single A100 40GB and scores 72.9 on MMLU-Pro. Qwen 3 8B runs on consumer GPUs (16GB VRAM with Q4 quantization via llama.cpp) and still manages 62.1 on MMLU-Pro, beating many 70B models from 2024.

Tip

The Qwen 3 8B model with Q4_K_M quantization runs at roughly 45 tokens per second on an M4 MacBook Pro. For local development and testing, this is the sweet spot between quality and speed in April 2026.

Mistral Medium 3

Mistral AI released Medium 3 on April 10 under an Apache 2.0 license. At 73B parameters (dense), it sits between the previous Mistral Large 2 and the smaller Mistral Small models. The headline feature is a 128K token context window with strong retrieval accuracy out to approximately 96K tokens (based on the RULER benchmark).

Medium 3 scores 77.0 on MMLU-Pro and 68.5 on GPQA Diamond. What makes it interesting for production use is the balance: it is small enough to serve on two A100 40GB GPUs while scoring within a few points of much larger models.

DeepSeek-R2 Lite

DeepSeek quietly released R2 Lite on April 7 under an MIT license. At 8B parameters, it is a reasoning-focused model trained specifically on math, coding, and logical deduction tasks. On GSM8K it scores 89.2 and on HumanEval it reaches 82.3.

The MIT license and small size make it ideal for embedding reasoning capabilities into applications where you control the hardware. The model uses a "think before answering" approach similar to Qwen 3's thinking mode, but the reasoning trace is always visible in the output.

Google Gemma 3 27B

Google's Gemma 3 27B arrived on April 3 with multimodal support (text and image input). It is the first open-weight Gemma model that handles vision tasks, scoring 74.1 on MMMU and 62.8 on MathVista. For text-only benchmarks it is competitive with Qwen 2.5 72B despite being roughly one-third the size.

The main limitation: Gemma's license restricts redistribution of fine-tuned derivatives in some commercial contexts. Read the Gemma license carefully if you plan to fine-tune and distribute.

Hardware Requirements Comparison

| Model | Min VRAM (FP16) | Min VRAM (Q4) | Recommended Setup | Tokens/sec (est.) | |---|---|---|---|---| | LLaMA 4 Scout | 48GB | 24GB | 1x A100 80GB | ~1,800 (batch) | | LLaMA 4 Maverick | 256GB | 96GB | 4x A100 80GB | ~900 (batch) | | Qwen 3 235B | 120GB | 60GB | 4x A100 40GB | ~1,200 (batch) | | Qwen 3 32B | 40GB | 16GB | 1x A100 40GB | ~2,500 (batch) | | Qwen 3 8B | 16GB | 6GB | M4 MacBook Pro | ~45 (single) | | Mistral Medium 3 | 80GB | 40GB | 2x A100 40GB | ~1,400 (batch) | | DeepSeek-R2 Lite | 16GB | 6GB | Consumer GPU | ~50 (single) | | Gemma 3 27B | 32GB | 14GB | 1x A6000 | ~60 (single) |

Common Pitfalls When Adopting New Models

  • Tokenizer mismatches. Qwen 3 uses a different tokenizer than Qwen 2.5. If you fine-tuned on Qwen 2.5, do not load those LoRA weights onto Qwen 3 without retraining. The vocabulary shifted enough to produce garbage output.

  • MoE memory spikes. LLaMA 4 Scout's 16 experts all need to be resident in memory even though only 2 are active per token. If you try to offload inactive experts to CPU, latency spikes to 3-5x, making it impractical for real-time serving.

  • License confusion. "Open source" and "open weights" are not the same. LLaMA 4's community license, Gemma's restricted license, and Cohere Command A's CC-BY-NC all have different commercial implications. Only Apache 2.0 and MIT models (Qwen 3, Mistral Medium 3, DeepSeek-R2 Lite, Phi-4 Mini) are fully permissive.

  • Context window vs. effective context. A model advertising 128K context does not guarantee equal quality at all positions. Test retrieval accuracy at 25%, 50%, 75%, and 100% of the context window before committing to long-context workflows.

Warning

Several models released in early April 2026 initially shipped with incorrect chat templates in their Hugging Face configs. Always check the model card's "Known Issues" section before deploying, and pin to a specific commit hash rather than pulling latest.

Which Model Should You Pick?

The answer depends on your constraints:

  • Best overall quality (no budget constraints): Qwen 3 235B or LLaMA 4 Maverick
  • Best quality per dollar: LLaMA 4 Scout or Mistral Medium 3
  • Best for local development: Qwen 3 8B (Apple Silicon) or DeepSeek-R2 Lite (reasoning tasks)
  • Best permissive license: Qwen 3 series (Apache 2.0) or Mistral Medium 3 (Apache 2.0)
  • Best for multimodal: Gemma 3 27B (only open model with vision this month)

Wrapping Up

April 2026 marks a turning point for open source AI models. The gap between proprietary and open-weight models has narrowed to single-digit percentage points on most benchmarks. For teams building AI-powered workflows, the practical question is no longer "are open models good enough?" but "which open model fits our hardware and licensing constraints?"

Fazm automates your desktop workflows with local AI agents that can use any of these models. Learn more or check out the source on GitHub.

Related Posts