New Open Source LLM Releases in April 2026: What Just Dropped and How to Run Them

Matthew Diakonov··13 min read

New Open Source LLM Releases in April 2026

The first twelve days of April 2026 produced more open source LLM releases than most entire quarters. If you blinked, you missed three or four model drops. This post is a practical rundown of every new open source model that shipped this month, organized by release date, with the exact commands you need to download and run each one locally.

Release Timeline: What Shipped and When

April 2026 Open Source LLM Release TimelineApr 1Apr 5Apr 8Apr 9Apr 12OLMo 2 32BApr 3 · Ai2Llama 4Apr 5 · MetaCommand AApr 7 · CohereQwen 3Apr 8 · AlibabaGemma 3nApr 9 · GooglePhi-4-reasoningApr 10 · Microsoft

Six organizations shipped new open source models within a single ten-day window. Here is what each one brings to the table and how to actually use them.

Complete List of New Releases

| Model | Org | Release Date | Params | Active Params | License | VRAM (Q4) | Download | |---|---|---|---|---|---|---|---| | OLMo 2 32B | Ai2 | Apr 3 | 32B | 32B (dense) | Apache 2.0 | ~20 GB | HuggingFace | | Llama 4 Scout | Meta | Apr 5 | 109B MoE | 17B | Llama 4 Community | ~24 GB | HuggingFace | | Llama 4 Maverick | Meta | Apr 5 | 400B MoE | 17B | Llama 4 Community | ~80 GB | HuggingFace | | Command A | Cohere | Apr 7 | 111B MoE | 11B | CC-BY-NC | ~18 GB | HuggingFace | | Qwen 3 (dense) | Alibaba | Apr 8 | 0.6B to 72B | Full | Apache 2.0 | 0.5 to 44 GB | HuggingFace | | Qwen 3 MoE | Alibaba | Apr 8 | 235B MoE | 22B | Apache 2.0 | ~32 GB | HuggingFace | | Gemma 3n | Google | Apr 9 | 2B / 4B eff. | Full | Gemma | ~3 GB | Kaggle | | Phi-4-reasoning | Microsoft | Apr 10 | 14B | 14B (dense) | MIT | ~9 GB | HuggingFace |

Note

VRAM estimates assume 4-bit quantization (Q4_K_M or equivalent). Dense models use their full parameter count on every forward pass. MoE models only activate a subset, so their VRAM requirement is lower than the total parameter count suggests, though you still need to load all parameters into memory.

Llama 4 Scout and Maverick: The Headline Releases

Meta dropped two models on April 5, both using Mixture of Experts. The naming signals the intended use: Scout is built for exploration (long context, efficient inference) while Maverick is built for performance (128 experts, multilingual strength).

What makes Scout notable: 10 million token context window. No other open source model comes close. If you need to process entire codebases, long legal documents, or full research paper collections in a single pass, Scout is the only open source option that handles it natively.

What makes Maverick notable: It matches GPT-4o on MMLU (87.2 vs 87.0) and outperforms it on multilingual benchmarks. The 128-expert design means the model is highly specialized per token, but you need distributed inference or very large VRAM to serve it.

Quick start with Ollama

# Scout (fits on a single GPU with quantization)
ollama pull llama4-scout
ollama run llama4-scout "Summarize the key changes in this file:" < large_file.txt

# Maverick (needs multi-GPU or large VRAM)
ollama pull llama4-maverick
ollama run llama4-maverick "Translate this to Spanish and Japanese:" < input.txt

Quick start with vLLM

pip install vllm>=0.7.3
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 131072

Qwen 3: The Most Flexible Release

Alibaba shipped Qwen 3 on April 8 with eight model sizes spanning 0.6B to 235B parameters. This is the widest range of any single model family released this month, and the Apache 2.0 license makes it usable in commercial products without restrictions.

The standout feature is hybrid thinking: Qwen 3 models can switch between a chain-of-thought reasoning mode and a direct response mode within the same conversation. You control this with a simple enable_thinking flag in the API call. In thinking mode, the model uses more tokens but produces better results on math, logic, and code generation tasks. In direct mode, it responds faster and uses fewer tokens.

| Size | Best Use Case | Thinking Mode | VRAM (Q4) | |---|---|---|---| | 0.6B | Edge devices, IoT | No | ~0.5 GB | | 4B | Mobile, Raspberry Pi | Yes | ~3 GB | | 8B | Laptop inference | Yes | ~5 GB | | 14B | Desktop workstation | Yes | ~9 GB | | 32B | Single GPU server | Yes | ~20 GB | | 72B | Multi-GPU server | Yes | ~44 GB | | 235B MoE | Production serving | Yes | ~32 GB |

Quick start with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", enable_thinking=True)
outputs = model.generate(**inputs.to(model.device), max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Gemma 3n: Built for Phones and Edge Devices

Google released Gemma 3n on April 9, targeting on-device inference. The "n" stands for nano, and the model uses a per-layer shared embedding architecture that reduces the effective parameter count. A model with 4B effective parameters actually loads and runs as if it were much smaller.

Gemma 3n is multimodal out of the box: it handles text, images, audio, and video input. For mobile developers building on-device AI features, this is the first time a single model can process all four input types while fitting in under 3 GB of RAM.

# Using LM Studio (desktop)
# Download Gemma-3n-E4B-it from the model browser
# Set context to 8192, runs on CPU with ~3 GB RAM

# Using MLX on Apple Silicon
pip install mlx-lm
mlx_lm.generate --model google/gemma-3n-E4B-it --prompt "Describe this image" --image photo.jpg

OLMo 2 32B: Fully Reproducible Research

Ai2 released OLMo 2 32B on April 3. What makes this release different from every other model on this list: Ai2 publishes the training data, training code, intermediate checkpoints, and evaluation code. If you need to audit exactly how a model learned what it knows, or if you want to continue pre-training from a specific checkpoint, OLMo 2 is the only option.

At 32B parameters with a fully dense architecture, it scores 62.9 on MMLU and 45.3 on HumanEval. These numbers do not beat Llama 4 or Qwen 3, but the full reproducibility makes it the model of choice for academic research, interpretability work, and compliance-sensitive deployments.

Command A: Enterprise RAG Focus

Cohere shipped Command A on April 7 as a 111B MoE model with only 11B active parameters. The model is optimized for retrieval-augmented generation: it performs well when you pass it a set of retrieved documents and ask it to synthesize an answer.

The CC-BY-NC license means you can use it for research and internal tools, but commercial deployment requires a Cohere license. For enterprise teams already using Cohere's API, the open weights let you run the same model on your own infrastructure.

Phi-4-reasoning: Small Model, Big Reasoning

Microsoft released Phi-4-reasoning on April 10. At 14B parameters with a dense architecture, it punches above its weight on reasoning benchmarks by using a chain-of-thought training approach. It scores 80.6 on AIME 2025 and 75.3 on GPQA Diamond, numbers that beat many models twice its size.

The MIT license makes it one of the most permissively licensed models released this month. If you need strong reasoning capabilities in a model small enough to run on a single consumer GPU, Phi-4-reasoning is worth testing.

# Using Ollama
ollama pull phi4-reasoning
ollama run phi4-reasoning "Prove that the square root of 2 is irrational."

How to Choose: Decision Guide

Picking the right model depends on your constraints. Here is a practical decision tree:

You have one consumer GPU (8 to 24 GB VRAM):

  • Code and reasoning tasks: Qwen 3 8B or 14B
  • On-device / mobile: Gemma 3n 4B
  • Strong reasoning on a budget: Phi-4-reasoning 14B

You have one datacenter GPU (40 to 80 GB VRAM):

  • General purpose: Qwen 3 32B or 72B
  • Long context processing: Llama 4 Scout
  • RAG workloads: Command A

You have multi-GPU or cloud:

  • Maximum quality: Llama 4 Maverick
  • Cost-efficient serving: Qwen 3 235B MoE

You need full reproducibility:

  • OLMo 2 32B (only option with full training data published)

Common Pitfalls When Running New Releases

Warning

New model releases frequently have quantization issues in the first few days. If you see garbled output or repetition loops, check for updated GGUF files before filing a bug. The community typically publishes stable quantizations within 48 to 72 hours of a release.

  • MoE models and VRAM: Even though Llama 4 Scout only activates 17B parameters, you still need to load all 109B parameters into memory. The "active params" number tells you about compute cost, not memory cost. Plan your VRAM based on total parameters, not active parameters.

  • Qwen 3 thinking mode tokens: When enable_thinking=True, the model can use 5 to 10x more tokens for reasoning. If you are paying per token or have tight latency requirements, test with thinking mode off first and only enable it for tasks that genuinely need step-by-step reasoning.

  • License traps: Command A (CC-BY-NC) and Llama 4 (Llama 4 Community License) both have commercial use restrictions. If you are building a product, Apache 2.0 models (Qwen 3, OLMo 2) or MIT models (Phi-4-reasoning) are the safe choices.

  • Context length vs. quality: Llama 4 Scout supports 10M tokens, but quality degrades on tasks beyond ~1M tokens in practice. The 10M number is a technical ceiling, not a practical recommendation. Test with your actual workload.

  • Day-one tooling gaps: Not all inference frameworks support new architectures immediately. vLLM added Llama 4 support in v0.7.3, but some quantization methods (AWQ, GPTQ) lagged by a week. Check your framework's release notes before planning a deployment.

What to Expect for the Rest of April

Based on announcements and previews, we are likely to see additional releases before the month ends. Mistral has teased a new model family, and several Chinese labs (DeepSeek, Yi) have hinted at April releases on social media. We will update this post as new models ship.

Wrapping Up

April 2026 gave open source AI developers more choices than any previous month. The combination of Llama 4's long context, Qwen 3's flexibility across eight sizes, Gemma 3n's on-device capabilities, and competitive smaller models like Phi-4-reasoning means there is now an open source option for nearly every inference scenario. The gap between open source and closed-source models continues to narrow, and for many production use cases, open source is now the practical default.

Fazm helps you automate macOS workflows with AI. If you are building local AI tooling on top of these open source models, Fazm can handle the system integration layer.

Related Posts