Open Source Large Language Model News April 2026: Timeline, Benchmarks, and What Changed

Matthew Diakonov·April 12, 2026·13 min read

open-source large-language-model news april-2026 llama-4 qwen-3 gemma-3n deepseek mistral

Open Source Large Language Model News April 2026

April 2026 delivered more open source large language model news in two weeks than most quarters produce in total. Seven production-ready models shipped from six different organizations, mixture-of-experts became the dominant architecture, and the regulatory picture in Europe finally sharpened. This is a comprehensive rundown of every major development, with benchmarks, hardware requirements, and licensing details that matter for production use.

Full Timeline of April 2026 Releases

| Date | Model | Organization | Parameters (Total / Active) | Architecture | License | |---|---|---|---|---|---| | Apr 2 | Llama 4 Scout hits 1M downloads | Meta | 109B / 17B | MoE (16 experts) | Llama 4 Community | | Apr 3 | OLMo 2 32B | Ai2 | 32B / 32B | Dense | Apache 2.0 | | Apr 5 | Llama 4 Maverick | Meta | 400B / 17B | MoE (128 experts) | Llama 4 Community | | Apr 5 | Qwen 3 72B | Alibaba | 72B / 72B | Dense | Apache 2.0 | | Apr 7 | DeepSeek V3 paper published | DeepSeek | 671B / 50B | MoE (fine-grained) | Research | | Apr 8 | Qwen 3 MoE 235B | Alibaba | 235B / 22B | MoE | Apache 2.0 | | Apr 8 | Codestral 2 | Mistral | 22B / 22B | Dense | Apache 2.0 | | Apr 9 | Gemma 3n | Google | 4B effective / 2B footprint | Dense multimodal | Gemma License | | Apr 10 | EU AI Act open source exemption guidance | European Commission | N/A | N/A | N/A | | Apr 11 | Gemma 3 9B commercial license | Google | 9B / 9B | Dense | Gemma License (updated) |

Benchmark Comparison Across Models

Raw benchmark numbers should always be taken with context, but they help establish a baseline when comparing open source large language model options. Here is where the April 2026 releases stand on three widely used evaluations.

| Model | MMLU-Pro | HumanEval | MATH-500 | Min VRAM (quantized) | |---|---|---|---|---| | Qwen 3 72B | 89.1 | 86.7 | 83.2 | 48GB (4-bit GPTQ) | | Qwen 3 MoE 235B | 88.4 | 85.9 | 82.7 | 32GB (22B active) | | Llama 4 Maverick | 87.9 | 84.2 | 81.5 | 32GB (17B active) | | Llama 4 Scout | 85.3 | 81.6 | 78.4 | 24GB (4-bit GGUF) | | OLMo 2 32B | 79.8 | 73.5 | 71.2 | 24GB (4-bit) | | Codestral 2 22B | 76.2 | 88.1 | 68.9 | 16GB (4-bit) | | Gemma 3 9B | 74.5 | 70.2 | 67.8 | 8GB (4-bit) | | Gemma 3n (2B footprint) | 61.3 | 52.4 | 49.1 | 2GB |

Important context

Benchmarks from model authors tend to use best-case evaluation settings. Community reproductions frequently report 2 to 5 points lower on reasoning tasks, especially for quantized versions. Always validate on your own data before switching production models.

Architecture Shift: Why Mixture of Experts Dominated

The single biggest theme in open source large language model news this month is the shift to mixture-of-experts architectures. Four of the seven major releases use MoE, and even the dense models are now benchmarked against MoE equivalents rather than other dense models.

The practical consequence: MoE models deliver higher quality at lower inference cost, because only a fraction of parameters activate per token. Llama 4 Scout uses 17B of its 109B total on each forward pass. This means you get performance that competes with dense 30B to 50B models, but the compute bill looks like a 17B model.

The tradeoff is memory. You still need to load all 109B parameters into VRAM (or system RAM for CPU inference), even though most sit idle. This makes MoE models cheaper to run per token but more expensive to host per instance than a dense model of equivalent quality.

Licensing Landscape: Who Can Use What

Licensing is the most overlooked aspect of open source large language model news. A model that performs well on benchmarks but carries a restrictive license may not be usable for your project.

| Model | License | Commercial Use | Fine-tuning | Redistribution | Key Restriction | |---|---|---|---|---|---| | Qwen 3 (all sizes) | Apache 2.0 | Yes | Yes | Yes | None | | OLMo 2 32B | Apache 2.0 | Yes | Yes | Yes | None | | Codestral 2 | Apache 2.0 | Yes | Yes | Yes | None | | Llama 4 Scout/Maverick | Llama 4 Community | Yes (under 700M MAU) | Yes | With conditions | Monthly active user cap | | Gemma 3 9B / 3n | Gemma License | Yes | Yes | With conditions | Responsible use restrictions | | DeepSeek V3 | Custom Research | Limited | Case by case | No | Research and evaluation only |

Apache 2.0 now covers three of the top-performing open source large language models released this month. This is a meaningful shift from even six months ago, when the best models almost always came with non-commercial or custom licenses.

Qwen 3: The Full Story

Alibaba released the entire Qwen 3 family on April 5 and 8, spanning eight model sizes from 600 million parameters to 235 billion. The headline result is the 72B dense variant scoring 89.1 on MMLU-Pro, which edges past GPT-4o's 88.7 on the same evaluation.

What makes Qwen 3 genuinely different from prior Qwen releases is the "thinking mode" toggle. You can prompt the model to show its chain-of-thought reasoning or suppress it for faster responses. This is configurable per request, not baked into separate model variants.

The multilingual capabilities are also noteworthy. Qwen 3 handles 119 languages with less quality degradation than competing models, particularly for CJK (Chinese, Japanese, Korean) and Arabic text. Teams building international products should evaluate the 72B and 235B MoE variants specifically for this strength.

Llama 4: Adoption and Limitations

Meta's Llama 4 family generated the most open source large language model news by volume this month. Scout, the smaller MoE variant, hit one million Hugging Face downloads faster than any previous open-weight release, reaching the milestone in four days. Llama 3 70B took eleven days for comparison.

Scout's appeal is practical: 17B active parameters in a 109B MoE architecture means it runs on a single 48GB GPU while outperforming many dense models twice its active size. Meta also shipped GGUF quantizations on launch day, which gave the llama.cpp community immediate access.

Maverick, the larger variant with 128 experts and 400B total parameters, offers better reasoning but requires multi-GPU setups for inference. Early adopters report two A100 80GB cards as the minimum comfortable configuration.

The limitations worth noting: Llama 4's license caps commercial use at 700 million monthly active users (relevant mainly for very large deployments), and Meta has not released the training data or training code.

EU AI Act: What the Open Source Exemption Means

The European Commission published implementation guidance on April 10 that resolves months of ambiguity around how open source large language models interact with AI Act compliance requirements.

Three key points:

Sub-10B parameter threshold. Open source large language models under 10 billion parameters released under recognized open source licenses are exempt from the "general-purpose AI" provider obligations. This includes the training data transparency requirements that had worried smaller labs.
Lighter path for larger models. Models above 10B parameters can still qualify for reduced compliance obligations if they meet the open source definition, but they must provide detailed model cards documenting training methodology and evaluation results.
Deployer obligations remain. Companies deploying these models in production carry their own compliance responsibilities regardless of the base model's license or open source status.

For builders, the practical impact is that models like Gemma 3 9B, Gemma 3n, and smaller Qwen 3 variants (0.6B through 8B) can be deployed in EU markets with minimal compliance overhead.

DeepSeek V3: Architecture Innovation

DeepSeek published the full technical paper for their V3 architecture on April 7. The model uses 671 billion total parameters with only 50 billion active per token, achieving performance comparable to dense 200B+ parameter models at roughly a quarter of the inference cost.

The core innovation, which DeepSeek calls "fine-grained expert segmentation," splits traditional MoE experts into smaller sub-experts. Instead of routing each token to 2 of 8 large experts, the model routes to 8 of 64 small experts. This allows more precise activation patterns that better match the requirements of each input.

The paper also details a curriculum learning approach for training: start with shorter sequences (2K tokens), gradually increase to the full context length (128K tokens), and adjust the learning rate schedule at each transition. This recipe is reproducible with existing frameworks like Megatron-LM and DeepSpeed.

On-Device Models: Gemma 3n Changes the Equation

Google's Gemma 3n, released April 9, fits a multimodal model (text, images, audio, video) into a 2GB memory footprint. It achieves this through a technique called "per-layer embedding," where audio and visual inputs are processed by shared layers rather than dedicated encoders, reducing parameter count without sacrificing modality coverage.

This matters for edge deployment. Running a capable multimodal model on a smartphone without a cloud connection was not realistic before this month. Gemma 3n makes it possible, with some quality tradeoffs compared to larger cloud models.

Practical note

Gemma 3n's audio and video processing is functional but not production-grade for complex tasks. It works well for basic captioning, classification, and simple question answering about media content. For detailed analysis, larger models remain necessary.

What to Watch in the Coming Weeks

| Development | Expected Timing | Why It Matters | |---|---|---| | vLLM 0.7 with MoE serving improvements | Late April | 2x throughput for Scout and Qwen 3 MoE | | Yi-Large 2 from 01.AI | Early May | New competitor in the 70B+ tier | | NVIDIA TensorRT-LLM MoE optimizations | Rolling | Hardware-specific speedups for MoE inference | | Additional EU AI Act fine-tuning guidance | Q2 2026 | Clarifies obligations for teams that modify base models | | Potential Llama 4 Behemoth (2T parameter) preview | TBD | Would be the largest open-weight model ever released |

Key Takeaways for Builders

If you are tracking open source large language model news to make production decisions, here is what April 2026 boils down to:

For general-purpose tasks: Qwen 3 72B offers the best benchmark performance under a fully permissive Apache 2.0 license. If you can afford the hardware (two A100 80GB cards or aggressive quantization on one 48GB card), it is the strongest option with no licensing constraints.

For cost-sensitive deployment: Llama 4 Scout gives you 85+ MMLU-Pro performance with only 17B active parameters. The Llama license is permissive enough for most commercial use cases.

For code generation: Codestral 2 at 22B parameters under Apache 2.0 is the best code-specific option. Its multi-file context handling outperforms larger general-purpose models on repository-level tasks.

For edge and mobile: Gemma 3n is the only viable option for on-device multimodal inference with a 2GB footprint.

For full transparency: OLMo 2 32B from Ai2 is the only model this month that published complete training data alongside the weights. If reproducibility and auditability matter for your use case, this is the choice.

# Quick start: try the top models locally

# Qwen 3 72B (needs 48GB+ VRAM with 4-bit quantization)
huggingface-cli download Qwen/Qwen3-72B-GPTQ-Int4 --local-dir ./qwen3-72b

# Llama 4 Scout (fits 24GB VRAM with 4-bit GGUF)
huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct-GGUF \
  --local-dir ./llama4-scout

# Gemma 3n (runs on phones and laptops)
huggingface-cli download google/gemma-3n-E2B-it --local-dir ./gemma-3n

Fazm provides verified execution logs for AI agent pipelines. When you deploy open source large language models in production agent workflows, Fazm records what each agent did so you can debug failures and prove compliance.