APRIL 2026 / HF + MAC AGENTS / WIRED VIA CUSTOM API ENDPOINT

Best new Hugging Face models for Mac agents, April 2026. Plus the three lines of Swift that wire any of them in.

Most lists for this topic rank Hugging Face models by leaderboard score, then call it a day. A Mac computer-use agent needs a different filter. The model has to expose structured tool calling. It has to serve in a shape the agent harness already speaks. It has to fit in unified memory and run at usable token rates on Apple Silicon. And the license has to let you ship. Four releases this month pass all four. The same field in Fazm's settings, customApiEndpoint, points the harness at any of them.

M
Matthew Diakonov
11 min read

Direct answer, verified 2026-04-29

Four Hugging Face releases from April 2026 combine working tool calling, MLX or GGUF support on Apple Silicon, and a translation path to the Anthropic chat schema that Fazm and other Claude Code based agents speak.

  1. Google Gemma 4 (E2B / E4B / 26B A4B / 31B, April 2 2026, Apache 2.0).
  2. Alibaba Qwen3.6-27B (April 22 2026, Apache 2.0, native mlx-lm and mlx-vlm support).
  3. Qwen3-Coder-Next (3B activated of 80B total, MLX-community 4-bit on day one).
  4. Multiverse LittleLamb 0.3B Tool-Calling (April 28 2026, Qwen3-0.6B base compressed roughly 50 percent).

All four are wired into Fazm by serving them behind LiteLLM or claude-code-router and pasting the proxy URL into Settings, AI Chat, Custom API Endpoint. The exact line in the repo is Desktop/Sources/Chat/ACPBridge.swift:398. Sources for the model claims: huggingface.co/blog/gemma4 , huggingface.co/Qwen/Qwen3.6-27B, huggingface.co/Qwen/Qwen3-Coder-Next.

The shortlist

The four April 2026 releases that pass the agent filter

Each card below names the variant a Mac agent actually wants (the instruction-tuned one with tool calling, not the base), the activated parameter count where mixture-of-experts changes the answer, and the inference engine that has it working today on Apple Silicon.

Google Gemma 4 31B-it

Apache 2.0, dense, 256K context, multimodal. Natively supported in Hugging Face transformers from day one (April 2 2026). At 4-bit, fits a 64GB MacBook Pro with comfort. Tool calling via the standard Gemma function-calling template, picked up by LiteLLM's Gemma route. The pick if you want a frontier-feel agent on local weights and your Mac can handle it.

Qwen3.6-27B

Apache 2.0, dense, 27B parameters, released April 22 2026. The Qwen team explicitly calls out tool calling as a strength, with vLLM's --tool-call-parser qwen3_coder flag wired in. Native mlx-lm and mlx-vlm builds on Hugging Face on day one. The single best value on the list for a local Mac agent that has to actually work.

Qwen3-Coder-Next

MoE, 3B activated of 80B total. Focused on coding agents and tool-call-heavy workflows. mlx-community has a 4-bit MLX build (Qwen3-Coder-Next-4bit) and lmstudio-community ships an LM Studio MLX variant. The pick if your agent's day job is editing files, running scripts, and chaining shell commands.

Gemma 4 E4B / 26B A4B

Two smaller Gemma 4 variants. E4B has 4.5B effective parameters (8B with embeddings) and 128K context, multimodal including audio on small models. 26B A4B is the mixture-of-experts variant with 4B activated. Both ship Apache 2.0. E4B is the right brain for a 16GB Mac.

Multiverse LittleLamb 0.3B Tool-Calling

Released April 28 2026. Built from Qwen3-0.6B, compressed about 50 percent with Multiverse's CompactifAI. Designed for tool routing, not planning. Pair it with one of the bigger models above as a cheap front-end intent classifier so you save the big model's tokens for hard cases. Bilingual (English and Spanish), dual inference modes for latency vs reasoning trade-off.

Sizes, dates, and license claims above are taken directly from each model's Hugging Face card. Click any model name in the FAQ for the live URL.

The agent filter

Four properties, in this order

A model that beats a leaderboard but fails any one of these is not a Mac agent brain. The order is intentional: each step rules out more candidates than the last.

Property 1

Structured tool calling the harness can parse

Either Anthropic-style tool_use blocks or a function-calling template the proxy can map onto them. Reasoning-only models that emit free-form chain-of-thought instead of a callable tool name fail here. Most of April's reasoning-only releases get cut at this step.

Property 2

A serve path that speaks the agent's schema

Fazm's harness, like Claude Code's, is built against the Anthropic /v1/messages shape. The model itself does not have to speak it. The proxy in front of it does. LiteLLM and claude-code-router both have Anthropic-compatible adapters.

Property 3

Apple Silicon at agent speed

MLX 4-bit or 8-bit on day one is the cleanest path. GGUF via llama.cpp's Metal backend is a fine alternative. Anything that requires CUDA or that can't be quantized below 8-bit on Mac fails here. Fazm sees a token rate, not a backend.

Property 4

A license you can actually ship

Apache 2.0 or MIT for anything you want to deploy at a customer site. Non-commercial research weights are fine for personal use, but they are not what most readers of this page actually need.

Inference engines you actually need

What runs the weights on your Mac

Pick one. They overlap more than people pretend. The ones below all support Apple Silicon natively and have a path to either Anthropic-compatible output (directly or through a one-config proxy) or to a tool that produces it.

MLX (Apple)
mlx-lm
mlx-vlm
llama.cpp + Metal
Ollama
LM Studio
vLLM (M-series)
KTransformers
Hugging Face transformers

On a Mac, MLX is fastest for the small-to-medium models on this list, llama.cpp is fastest for GGUF quants and the most flexible if you need exotic quantization, Ollama is the simplest if you do not want to write a serve config. All three have wrappers that translate to Anthropic-compatible output.

The bridge

Anthropic-compatible proxies that speak both sides

The proxy is the part that turns a Hugging Face model behind any inference engine into something Fazm's harness will accept. You only need one of these.

The anchor fact

Three lines of Swift, two files, one UserDefaults key

The reason this list ends up specific to Fazm is not that the agent is hardcoded against any of these models. It is the opposite. The agent has exactly one place where the choice of brain enters the program, and once you know what that place is, every model on Hugging Face is, in principle, a candidate. Here is the wire.

The settings field, in Desktop/Sources/MainWindow/Pages/SettingsPage.swift, around line 887:

@AppStorage("customApiEndpoint")
private var customApiEndpoint: String = ""

// ... later in the AI Chat settings card ...
TextField("https://your-proxy:8766", text: $customApiEndpoint)

The bridge, in Desktop/Sources/Chat/ACPBridge.swift, at lines 397 to 400:

// Custom API endpoint (allows proxying through Copilot,
// corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"),
   !customEndpoint.isEmpty {
  env["ANTHROPIC_BASE_URL"] = customEndpoint
}

That is the whole bridge. The agent process inherits the env var, Claude Code's SDK reads it, and every tool-calling round trip goes to the URL you typed into the field. The URL points at LiteLLM or claude-code-router. The router fans out to whatever inference engine is serving your Hugging Face weights. Fazm never knows which model is on the other end. None of the agent's twenty-something MCP tools, eight prompt overrides, or three timeout tiers are touched by the change.

The wiring, end to end

Four steps from Hugging Face URL to working Mac agent

This is what the loop looks like when you do it by hand. The numbers below are the order a real first-run goes in.

  1. 1

    Pull the weights

    ollama pull qwen3.6-27b, or mlx-lm download Qwen/Qwen3.6-27B. Whichever engine.

  2. 2

    Run the inference server

    ollama serve on localhost:11434, or mlx-lm.server on a port you pick.

  3. 3

    Put the proxy in front

    litellm --model ollama/qwen3.6-27b --api_base http://localhost:11434 --port 8766

  4. 4

    Paste the URL into Fazm

    Settings > AI Chat > Custom API Endpoint. http://localhost:8766. Save. Restart chat.

3

The only Swift in Fazm that knows about your model choice is three lines in ACPBridge.swift. Everything else, the routing table, the tool registry, the screen capture, is model-agnostic on purpose.

Desktop/Sources/Chat/ACPBridge.swift, lines 397 to 400

The numbers that decide which model

Quick sizing for the Mac you actually have

Activated parameters at 4-bit, plus the agent process itself, plus Chrome and whatever else is open. The numbers below are headline-only. Real footprint will vary with context length and KV-cache size, sometimes by a multiple.

16GB Mac

0B

Gemma 4 E4B

Effective parameters at 4-bit. Multimodal text and image. The right brain when memory is tight.

32GB Mac

0B

Qwen3.6-27B

Dense, native MLX 4-bit. Tool calling explicitly supported by the team. The honest best value on this list.

64GB Mac

0B

Gemma 4 31B-it

256K context. Apache 2.0. The closest a local agent gets to a frontier feel today.

If your Mac is the 8GB base M1 or M2: don't use any of the four as your only brain. Use Fazm's built-in Claude account or your personal Claude account for the planning loop, and route LittleLamb 0.3B Tool-Calling through the Custom API Endpoint only for the bulk-data-entry parts of a workflow that are tool-call-heavy and reasoning-light. That is the honest answer for the smallest Macs.

What changes next

Why this list will be wrong in six weeks

Two trends will bend the shortlist. First, the Hugging Face Inference Providers catalog is filling up with one-click serve targets. The proxy step in the wiring above will collapse to a single hosted endpoint for the most popular models, and the Custom API Endpoint field will eventually accept those URLs directly. Second, the Anthropic-compatible adapters that LiteLLM and claude-code-router currently provide are starting to land natively in the inference engines. vLLM has an experimental Anthropic-compatible router in mainline. llama.cpp's server can be wrapped in two flags.

By June, expect the four-model shortlist to grow to six or seven. The wire path through Fazm's Custom API Endpoint stays the same. That is the part of the architecture that was worth designing once and not rewriting.

Want to try one of these as the brain of your own Mac workflow?

Fifteen minutes. Tell me which Mac you have, what you want the agent to do, and which of the four models is actually the right pick. I will save you a weekend of yak-shaving.

Frequently asked questions

What is the agent filter, and why does it cut most April 2026 Hugging Face releases?

The agent filter is the four properties a Mac computer-use agent actually needs from a model: structured tool calling that the harness can parse, a serving path that speaks the chat schema the agent was built against (in Fazm's case, the Anthropic /v1/messages shape), a quantized variant that fits in unified memory at usable token rates on Apple Silicon, and a license you can ship with. Most April 2026 releases hit one or two. Reasoning-only models without tool-call output skip the first. SaaS-only releases skip the second. Anything above 70B at full precision skips the third. Non-commercial weights skip the fourth. Four releases this month hit all four properties, which is why this list is short.

Why does the model need to speak the Anthropic API shape if Fazm is open source?

Because the agent harness on top of Claude Code (which Fazm builds on) was written against /v1/messages with the Anthropic tool_use and tool_result blocks. That shape is what the bridge serializes and what the streaming parser expects. Hugging Face models don't natively speak it, they speak whatever their template says (often OpenAI chat). The bridge between the two is a translation proxy. LiteLLM has an Anthropic-compatible adapter, so does claude-code-router. You point the proxy at the HF model (via vLLM, llama.cpp, MLX, or Ollama on the back end) and the proxy emits the Anthropic shape on the front end. Fazm sees a normal Anthropic endpoint and never knows the difference.

Where exactly in Fazm do I paste the proxy URL?

Settings, then AI Chat, then the Custom API Endpoint card with the toggle. The placeholder text in the field is https://your-proxy:8766. The field source is in Desktop/Sources/MainWindow/Pages/SettingsPage.swift around line 983 if you want to see the @AppStorage binding. The value is written to UserDefaults under the key customApiEndpoint. Three lines later in Desktop/Sources/Chat/ACPBridge.swift (line 398), the bridge reads that key back and exports it as ANTHROPIC_BASE_URL into the env of the agent process. That env var is what Claude Code's SDK uses to decide where to send tool-calling messages. So once the value is saved, the next chat session goes to your proxy instead of Anthropic, and the proxy decides which model on Hugging Face actually runs.

Which of the four April 2026 models has the best chance of replacing claude-sonnet-4-6 in a Mac agent today?

Honest answer: none of them as a strict drop-in for the full computer-use loop. Qwen3.6-27B is the closest on coding and tool-calling fluency, with Apache 2.0, native MLX 4-bit quants, and an explicit tool_call_parser flag in vLLM and SGLang. Where it falls short relative to a frontier model is multi-step planning under noisy accessibility-tree input and recovery from a mid-task tool error. For a single bounded task, like a CRM data entry workflow you have already taught the agent, Qwen3.6-27B running locally through LiteLLM is genuinely usable and the latency wins back what the planning loses. For an open-ended agent session, you will still feel the gap. Gemma 4 31B is closest behind, with the bonus of fitting comfortably on a 64GB MacBook Pro at 4-bit. LittleLamb 0.3B Tool-Calling is a useful tool-router model in front of a bigger brain, not the brain itself.

What is special about Multiverse Computing's LittleLamb 0.3B Tool-Calling for Mac agents?

It's a 0.3B model, built from Qwen3-0.6B, compressed about fifty percent with Multiverse's CompactifAI technique, and explicitly fine-tuned for tool-calling and agentic workflows. The size is the point. On a Mac, a 0.3B parameter model in 4-bit is under 250MB and runs at hundreds of tokens per second on the Neural Engine via Core ML or on CPU via llama.cpp. That's too small to plan a multi-step computer-use task, but it's an excellent intent classifier and tool router. A common pattern in early-2026 Mac agent stacks: LittleLamb out front to decide which MCP tool to call and with what arguments, a heavier model behind it (Gemma 4 31B, Qwen3.6-27B) only when the task crosses a complexity threshold. You save the big model's tokens for the hard cases.

Do I need a beefy Mac to run any of this?

It depends on which of the four. LittleLamb 0.3B runs on any M-series Mac, including base M1 with 8GB. Gemma 4 E4B (4.5B effective parameters) and Qwen3-Coder-Next at 4-bit (3B activated of 80B total) work on 16GB. Qwen3.6-27B at 4-bit needs 32GB to be comfortable, more if you want to run other apps at the same time. Gemma 4 31B at 4-bit and 26B A4B mixture-of-experts both want 64GB or more. The agent itself, Fazm's Swift process plus the Node bridge plus Chrome, eats roughly 2 to 4GB at idle, so subtract that from the budget. If you only have 16GB, the right call is to use Fazm's built-in Claude account for the heavy planning and route a smaller HF model through the Custom API Endpoint only for the bulk-data-entry parts of the workflow that are tool-call-heavy and reasoning-light.

Will any of this break if Hugging Face changes their model card layout next week?

No, because the agent never reads the model card. The Hugging Face model card layout matters for the human who picks the model. Once the weights are downloaded (via Ollama pull, llama.cpp model download, or huggingface-cli) and served behind LiteLLM or claude-code-router, the agent only sees the proxy. Fazm reads the proxy's response, not Hugging Face's HTML. That's also why the Custom API Endpoint setting was added in the first place: the underlying agent code does not care where the next-token output comes from, as long as the wire shape is Anthropic-compatible. Switching from one HF model to another is one URL change and one proxy restart.

Where does this list go next month?

Two things to watch. The Hugging Face Inference Providers catalog is filling up with one-click serve targets, which means the proxy step will collapse to a single hosted endpoint for the most popular models. And the Anthropic-compatible adapters are starting to land natively in the inference engines themselves: vLLM has an experimental Anthropic-compatible router, llama.cpp's server can be wrapped with claude-code-router in two flags. By May 2026, expect the four-model shortlist here to grow to six or seven, but the wire path through Fazm's Custom API Endpoint stays the same.

Is Fazm tied to any of these specific Hugging Face models?

No. Fazm has no model preference baked in. The default Claude Account picker in Settings has two options, Built-in (Fazm's hosted Claude) and Personal (your own Claude account via OAuth). The Custom API Endpoint is a separate toggle below them and overrides whichever account is selected. The repo at github.com/mediar-ai/fazm is MIT-licensed, the bridge is auditable, and the only string Fazm cares about is whatever is in the customApiEndpoint UserDefaults key. Plug in any HF model that can be served as Anthropic-compatible, and the agent runs on it.