Open Source LLM Releases 2026: Every Major Model So Far
Open Source LLM Releases 2026: Every Major Model So Far
The first quarter of 2026 produced more open-weight large language models than any comparable period in AI history. Meta, Alibaba, Mistral, DeepSeek, Google, and several smaller labs all shipped significant model families between January and April. This post tracks every notable release, compares their capabilities, and explains which model fits which use case.
2026 Open Source LLM Release Timeline
| Model | Organization | Release Date | Parameters | Architecture | License | Context Window | |---|---|---|---|---|---|---| | DeepSeek V3-0324 | DeepSeek | Jan 2026 | 685B (37B active) | MoE | MIT | 128K | | Qwen 2.5-Max | Alibaba | Jan 2026 | Undisclosed | Dense | Qwen License | 128K | | Gemma 3 | Google | Jan 2026 | 1B / 4B / 12B / 27B | Dense | Gemma License | 128K | | Mistral Small 3.1 | Mistral | Feb 2026 | 24B | Dense | Apache 2.0 | 128K | | Command A | Cohere | Mar 2026 | 111B (23B active) | MoE | CC-BY-NC | 256K | | Llama 4 Scout | Meta | Apr 2026 | 109B (17B active) | MoE | Llama 4 License | 10M | | Llama 4 Maverick | Meta | Apr 2026 | 400B (17B active) | MoE | Llama 4 License | 1M | | QwQ-32B | Alibaba | Mar 2026 | 32B | Dense | Apache 2.0 | 128K | | DeepSeek R2 | DeepSeek | Mar 2026 | 685B (37B active) | MoE | MIT | 128K | | Mistral Large 3 | Mistral | Apr 2026 | 123B | Dense | Mistral Research | 128K |
Note
"Open source" here includes open-weight models with permissive or semi-permissive licenses. Some models (Llama 4, Command A) have usage restrictions above certain monthly active user thresholds. Always check the specific license before deploying in production.
Release Timeline Diagram
Llama 4: Scout and Maverick
Meta released two Llama 4 models in early April 2026. Both use a Mixture of Experts (MoE) architecture, which is a departure from the dense transformers used in Llama 2 and Llama 3.
Llama 4 Scout has 109 billion total parameters with 16 experts, routing to one expert per token. Only 17 billion parameters are active during inference, making it runnable on a single H100 GPU. The headline feature is a 10 million token context window, the longest of any open-weight model.
Llama 4 Maverick scales up to 400 billion total parameters (128 experts, also 17B active per token). It targets a 1 million token context window and achieves performance competitive with GPT-4o and Gemini 2.0 Pro on standard benchmarks.
Both models are natively multimodal, handling text and images in a single architecture rather than bolting on separate vision encoders.
Llama 4 Benchmark Highlights
| Benchmark | Llama 4 Scout (109B) | Llama 4 Maverick (400B) | Qwen 2.5-72B | Gemma 3-27B | |---|---|---|---|---| | MMLU Pro | 74.3 | 80.5 | 71.1 | 67.8 | | GPQA Diamond | 57.2 | 69.8 | 49.0 | 42.7 | | LiveCodeBench | 32.8 | 43.4 | 28.7 | 22.1 | | MATH-500 | 83.5 | 81.5 | 80.0 | 74.2 | | Multilingual MGSM | 90.5 | 92.3 | 85.2 | 79.6 |
Running Llama 4 Locally
Scout fits on a single GPU with 80 GB VRAM (H100, A100). For consumer hardware, quantized versions (GGUF Q4) can run on machines with 48+ GB of unified memory (Mac Studio M2 Ultra or similar).
Maverick requires multi-GPU setups or aggressive quantization. Running the full model needs approximately 200 GB of VRAM across multiple cards.
# Example: running Llama 4 Scout with vLLM
pip install vllm>=0.7.0
vllm serve meta-llama/Llama-4-Scout-109B-16E-Instruct \
--tensor-parallel-size 1 \
--max-model-len 65536
DeepSeek V3 and R2
DeepSeek continued its aggressive release cadence into 2026. The V3-0324 update in January improved the already strong DeepSeek V3 with better instruction following and coding performance. DeepSeek R2, released in March, is a reasoning-focused model built on the V3 architecture.
Both models use the same 685B total / 37B active MoE architecture. The key difference is training: R2 uses reinforcement learning from human feedback (RLHF) with chain-of-thought reasoning traces, similar to OpenAI's o1 approach but fully open-weight.
DeepSeek R2 stands out for reasoning tasks. On AIME 2025, it scores 79.7% compared to V3's 39.4%. On GPQA Diamond, R2 reaches 72.0% versus V3's 59.1%. For coding and general knowledge, V3-0324 remains competitive and is significantly faster since it does not generate reasoning traces.
Both are released under the MIT license, the most permissive license of any model in this class.
Qwen 2.5-Max and QwQ-32B
Alibaba's Qwen team shipped two significant models. Qwen 2.5-Max, released in January, is a large dense model that competes with GPT-4o on benchmarks like Arena-Hard and LiveBench. Its parameter count has not been officially disclosed but it is estimated to be in the 300B+ range.
QwQ-32B, released in March under Apache 2.0, is a 32 billion parameter reasoning model. Despite its relatively small size, it matches or exceeds much larger models on reasoning benchmarks. On AIME 2025, QwQ-32B scores 79.5%, essentially tying DeepSeek R2 at a fraction of the compute cost.
QwQ-32B has become particularly popular for local deployment because its 32B size fits comfortably on consumer GPUs (24 GB VRAM with 4-bit quantization).
Gemma 3
Google released Gemma 3 in January 2026 with four sizes: 1B, 4B, 12B, and 27B parameters. The Gemma 3 family is notable for its efficiency at smaller scales. The 27B model runs on a single RTX 4090 and outperforms many 70B-class models from 2024 on standard benchmarks.
Key improvements over Gemma 2 include native vision support (image understanding without a separate adapter), a 128K context window across all sizes, and a new sliding-window attention mechanism that reduces memory usage.
The Gemma license allows commercial use with a 1 billion monthly active user restriction, similar to Llama but with a higher threshold.
Mistral Small 3.1 and Mistral Large 3
Mistral released two models in the first quarter:
Mistral Small 3.1 (February, 24B parameters, Apache 2.0) adds vision capabilities and extends the context window to 128K tokens. It targets the "edge deployment" segment where you need a model that runs on a single mid-range GPU. On the OpenLLM Leaderboard, it outperforms Llama 3.3-70B despite being one-third the size.
Mistral Large 3 (April, 123B parameters, Mistral Research License) is the company's flagship open-weight model. It competes directly with Llama 4 Maverick and GPT-4o. The Mistral Research License allows research and evaluation use but requires a commercial agreement for production deployment.
Command A (Cohere)
Cohere released Command A in March 2026. It uses a 111B parameter MoE architecture with 23B active parameters and supports a 256K context window. Command A is optimized for retrieval-augmented generation (RAG) workflows, with strong performance on tasks that require synthesizing information from long documents.
The model is released under CC-BY-NC, which prohibits commercial use. Cohere offers a commercial license separately through its API platform.
Choosing the Right Model
The "best" open source LLM depends on your deployment constraints and use case:
| Use Case | Recommended Model | Reason | |---|---|---| | Reasoning / Math / Science | DeepSeek R2 or QwQ-32B | Top reasoning scores; QwQ-32B if you need smaller footprint | | General assistant (server) | Llama 4 Maverick | Broad benchmark leader, multimodal | | General assistant (single GPU) | Llama 4 Scout | 17B active params, fits on one H100 | | Local / Consumer hardware | QwQ-32B or Gemma 3-27B | Fit on 24 GB VRAM with quantization | | Long context (1M+ tokens) | Llama 4 Scout | 10M token context window | | RAG workloads | Command A | Optimized for retrieval synthesis | | Permissive license needed | DeepSeek V3/R2 (MIT) | No usage restrictions | | Edge / Mobile | Gemma 3-1B or Gemma 3-4B | Smallest capable models available |
Architecture Comparison
What to Expect Next
Several models are anticipated for Q2 and Q3 2026:
- Llama 4 Behemoth (Meta): the largest Llama 4 variant, rumored at 2 trillion total parameters. Meta has confirmed the model is in training but has not given a release date.
- Qwen 3 (Alibaba): Alibaba's next-generation model family. Early benchmarks have appeared on evaluation leaderboards but no official release date has been announced.
- Mistral Medium 3 (Mistral): expected to fill the gap between Small 3.1 and Large 3.
- Falcon 3 (TII): the Technology Innovation Institute has announced work on a successor to Falcon 2, targeting competitive performance at the 40B parameter range.
How Fazm Helps You Work with Open Source LLMs
If you run open source models locally or connect to self-hosted inference endpoints, Fazm can automate workflows across your desktop applications using those models as the backend. Fazm's desktop agent connects to any OpenAI-compatible API, so you can point it at a local vLLM, Ollama, or TGI server running any of the models listed above. Your data stays on your machine, and you control the model.