APRIL 2026 / LOCAL IMAGE AI ON APPLE SILICON

Local image generator AI on Mac: the orchestration gap nobody covers

Every 2026 roundup for this keyword names the same five apps, ranks them on VRAM and speed, and calls it a guide. DiffusionBee, Draw Things, Mochi Diffusion, ComfyUI, Klein. None of them asks the question that actually matters once you have picked one: how do you drive a local image generator from an AI agent, in a batch, without sitting in front of the window clicking Generate all afternoon. This guide covers the five apps honestly, then shows the accessibility-API path a consumer Mac app can take to orchestrate them, with the exact bundled binary and source line that makes it work without a single screenshot.

Matthew Diakonov, Fazm

Published April 21, 202610 min read

4.9from Written from the Fazm source tree

Five-app landscape, honest

Accessibility tree, not screenshots

Bundled mcp-server-macos-use binary

Source lines cited

Image gen stays fully on-device

Local image gen, one missing layer

The apps run local. The AI that drives them usually does not.

Five apps dominate Mac local image gen in 2026

All run models on-device, none automate themselves

Fazm adds the orchestration layer

Accessibility tree, not pixel scraping

Image pixels never leave your Mac

0:00 / 0:05

THE FIVE-APP LANDSCAPE

The only five apps worth naming in April 2026

Almost every SERP result for this keyword boils down to the same shortlist, padded with reruns of Fooocus or InvokeAI for length. Here is the shortlist without the padding, and what each app is actually good for on a real MacBook.

DiffusionBee

The zero-config option. Drop the DMG in Applications, open the app, type a prompt, hit Generate. 30 seconds per image on an 8 GB M1 Air for SD 1.5. No Python, no dependencies, no surprises. The easiest first step for anyone who has never run a local model.

Draw Things

The serious option for Apple Silicon. MLX-native pipelines for FLUX, SDXL, SD 3.5, dozens of community models. Dense UI, full control over samplers, schedulers, CFG, seed, CoreML caching. This is the one production users actually use.

Mochi Diffusion

Swift-native, CoreML-first, minimal chrome. Best for quantized CoreML models. Small footprint, good quality per watt, easy to keep running in the background while you do something else.

ComfyUI

Node-graph powerhouse. MPS backend on Apple Silicon gets you FLUX, SDXL, HunyuanVideo workflows, control nets, the whole ecosystem, at the cost of a Python install and a browser UI that lives in localhost:8188. If your work needs reproducible graphs, ComfyUI is the only option that covers it on Mac. It is also the one every power user settles on after trying everything else.

Klein

The 2026 newcomer. Small MLX app that generates competent images in under a minute on laptop hardware. Quantized weights, no API keys, no cloud. Worth trying if you want something between DiffusionBee's simplicity and Draw Things' depth.

Feature	Notes	Apple Silicon fit
DiffusionBee (SD 1.5, SDXL)	~30s per image on M1 8 GB	Fits any M-series Mac
Draw Things (FLUX, SDXL, SD 3.5 via MLX)	Fastest serious Mac app in the category	Best serious fit
Mochi Diffusion (CoreML quantized)	Good for always-on background use	Low-watt winner
ComfyUI (MPS Python)	Powerful, painful to install, browser UI	Works, not fun
Klein (MLX small model)	Under a minute on laptop hardware	Fits laptops
Automatic1111 WebUI on Mac	Works but is slower than Draw Things in 2026	Not recommended
HunyuanVideo / Wan 2.6 i2v locally on Mac	Needs 24 GB NVIDIA, stays on that side	Not a Mac story

0 / 10

“Every roundup covers which app to install. None cover how to drive it from an AI agent on the same machine. That is the gap.”

SERP pages analyzed, none mention accessibility-API orchestration

THE ORCHESTRATION GAP

The part where every listicle stops

Picking an app is the easy part. The interesting work starts once you want to do anything a real person does with an image generator: render 40 variations of a product shot at different seeds; iterate a prompt against a visual target; feed the result into the next design task; keep the whole loop on your Mac. That is the workflow every listicle silently assumes you will figure out yourself.

The existing options for bridging the gap are all uncomfortable.

Option 1: click Generate 40 times

The honest default. Works, eats your afternoon, breaks as soon as you need to touch two apps in sequence.

Option 2: drop to Python with diffusers or ComfyUI API

Powerful, developer-only. You lose the nice native apps you picked and you maintain a second toolchain.

Option 3: a browser-only agent with screenshot OCR

Wrong shape. Native image generator apps are not in the browser. Screenshot-based drivers guess pixel coordinates and break on retina scaling, dark mode, and app updates.

Option 4: a Mac-native agent on the system accessibility tree

The fit-for-purpose option, and the one no roundup names. That is what the rest of this guide is about.

TWO WAYS TO AUTOMATE A UI

Screenshot OCR versus accessibility tree

For an image generator app, where a single Generate click can take 30 seconds to minutes, the agent needs to land every click. The two strategies for landing a click have very different failure modes.

The same task, two automation strategies

// 1. Full-screen screencapture to PNG
screencapture -x /tmp/frame.png

// 2. Send PNG to a vision model
//    "find the Generate button, return xy"
const { x, y } = await vision.analyze(png, "Generate button")

// 3. Click those pixels
cliclick c:${x},${y}

// Retina? Multiply by 2.
// Dark mode? Re-prompt.
// App updated? Hope the model still finds it.

17% fewer steps, fewer failure modes

Screenshot-based agents are the default in the 2026 agent ecosystem because they generalize to any screen content. For a native Mac app with a published accessibility tree, that generality buys nothing and costs reliability. The tree is already there, exposed by the OS, used by VoiceOver, read by assistive tech for decades.

THE ANCHOR

The exact binary Fazm ships for this

The accessibility-automation layer is not a sidecar or an optional plugin. It is a bundled Swift MCP server binary that ships inside the Fazm app bundle. Two registrations make it real.

acp-bridge/src/index.ts (line 63)

That is the resolved path on disk inside the shipped app. Now the registration into the MCP server list the reasoning layer actually uses:

acp-bridge/src/index.ts (line 1056-1062)

Three things to notice in those six lines, because they are the parts that make the whole story on this page actually work:

native macOS accessibility automation. That comment in the source is not marketing. It is the literal description of what the binary does. No screenshot loop, no vision model, no pixel guessing.
name: "macos-use". The tool family exposed to the reasoning layer, namespaced as mcp__macos-use__*. The system prompt in Desktop/Sources/Chat/ChatPrompts.swift line 59 says plainly: "Desktop apps: macos-use tools for Finder, Settings, Mail, etc." Same tool family drives an image generator.
existsSync guard, no args, no env. It is a bundled binary, not a download, not a Python subprocess, not a login. If the file exists in Contents/MacOS, the bridge starts it. That is the whole activation story.

What a driven Draw Things session actually looks like

The accessibility tree for a native Mac image generator is readable and small. Here is what a condensed dump looks like when Draw Things is focused, with the elements an agent would target marked inline.

Accessibility tree dump, Draw Things (abridged)

The agent reads that tree, finds the Positive Prompt text area by role and label, types the prompt, sets the seed value in the Seed text field, clicks the Generate button, waits for the Generation Output element's path value to change, and then hands the new file path to whatever wanted the image. None of that requires a single pixel to be screenshotted or OCR'd.

A concrete batch the agent can run end to end

Driving Draw Things for a 10-variation batch

1
Focus the app
macos-use opens or activates Draw Things by bundle id, refreshes the accessibility tree.
2
Write the prompt
Types the positive prompt into AXTextArea 'Positive Prompt', sets negative prompt, model, steps.
3
Loop the seed
For each seed in a range, updates AXTextField 'Seed', clicks AXButton 'Generate', waits.
4
Collect outputs
Reads AXImage 'Generation Output' path value per iteration, copies filenames into a manifest.
5
Hand off results
Passes the file paths to the next step (a chat reply, a design tool, a Finder move, a Slack upload).

Where the data actually flows

The full path from a natural-language request to a finished batch of images has one hub and two sides. The hub is the accessibility-automation binary. The left side is intent; the right side is the image generator app and its local model.

From a prompt in the Fazm chat to PNGs on your disk

Notice what is not in that diagram. No hosted image service on the right. No screenshots in the middle. The pixels never travel anywhere. The only thing that can leave your Mac is the text of your prompt, and only if the reasoning layer is routed to a hosted provider. That split is a settings choice.

The ecosystem an accessibility-based driver unlocks

The same binary that drives image generators also drives the rest of your Mac. That is why the orchestration layer is useful beyond a single app.

macos-use

Draw Things

DiffusionBee

Mochi Diffusion

Finder

Mail

Slack

Figma

Notes

BY THE NUMBERS

Local image gen on Mac in 2026, the numbers that matter

0apps worth naming

0built-in MCP servers in Fazm

0 GBVRAM to run FLUX.1 schnell on Apple Silicon

0stypical DiffusionBee render time, M1 Air

DiffusionBeeDraw ThingsMochi DiffusionComfyUI (MPS)KleinFLUX.1SDXLSD 3.5MLXCoreML

The honest takeaway

If you want a local image generator AI on your Mac, install one of the five apps. DiffusionBee if you want simple. Draw Things if you want serious. Mochi Diffusion if you want low-watt. ComfyUI if you want node graphs. Klein if you want the new small thing. All of them run pixels on your machine and keep them there.

If you also want an AI agent that can drive these apps for you, in batches, against real workflows, without pushing your prompts to a browser or a Python stack, the layer you are missing is an accessibility-tree driver. A 0 bundled binary does the job. It lives in Contents/MacOS/mcp-server-macos-use in the Fazm app bundle, it is registered alongside playwright, whatsapp, and google-workspace in the bridge, and it is the same family of tools the agent uses for Mail and Finder. Point it at an image generator, it clicks the buttons.

That orchestration layer is the missing link between "you have five good local image apps" and "you have a real local workflow." Every SERP result on this keyword stops at the first half. This page exists because the second half is where the actual value is.

See Fazm drive a local image generator on your Mac

Bring a workflow, we will walk through batching Draw Things or DiffusionBee end to end on your machine, accessibility tree and all.

Book a call →

Frequently asked questions

Which local image generator should I install on a Mac in 2026?

If you want the simplest path, DiffusionBee. Download a DMG, drop it in Applications, generate from a prompt in under a minute on any M-series Mac. If you want the fastest on Apple Silicon, Draw Things, which ships MLX-native pipelines for FLUX, SDXL, and SD 3.5 and runs real production workloads without a Python stack. If you want the best quality floor, Mochi Diffusion with a quantized CoreML model. If you want node-graph control over everything, ComfyUI with the MPS backend, which is the painful but powerful option. The newcomer worth naming is Klein, a small MLX-based app that generates in under a minute on laptop hardware with no cloud and no API keys. Those five cover about 95 percent of the 2026 searches for this keyword.

What is the 'orchestration gap' this guide is about?

Every listicle stops at 'install one of these apps.' What none of them cover is what happens when you want to run a batch workflow, iterate prompts inside a feedback loop, or plug the generator into the rest of your work. You either sit there clicking Generate 50 times, or you drop to the command line with a Python framework like diffusers, or you build a ComfyUI API integration. The consumer-friendly middle path, talking to a local image generator in natural language from your desktop, driving its real UI, and staying on-device for every step, is the one no roundup covers. That is the gap.

How does Fazm fit in, if Fazm is not an image generator?

Fazm does not generate pixels. Fazm is the consumer Mac AI app that can drive Draw Things, DiffusionBee, Mochi Diffusion, or any other native Mac image generator, using the system accessibility tree. It ships a bundled Swift MCP server called mcp-server-macos-use that uses AXUIElement APIs to click buttons by label and type into text fields by role, rather than screenshotting the window and guessing pixel coordinates. The image generator still does the generating; Fazm does the clicking, the batching, the saving, and the reading of results. That is an orchestration layer, not a model.

Where in the Fazm source can I verify the accessibility-based driver?

Two files. The registration is in acp-bridge/src/index.ts around line 63 where the binary path is declared as join(contentsDir, 'MacOS', 'mcp-server-macos-use'), and again around line 1056 where the comment says 'mcp-server-macos-use (native macOS accessibility automation)' and the server is pushed into the bridge config with name 'macos-use', no args, no env. The system prompt side lives in Desktop/Sources/Chat/ChatPrompts.swift line 59, which tells the model: 'Desktop apps: macos-use tools (mcp__macos-use__*) for Finder, Settings, Mail, etc.' The same tool family the prompt uses for Mail is the one available for an image generator app.

Why does accessibility-tree automation beat screenshot-based automation for image apps?

Screenshot-based agents feed a rendered pixel grid to a vision model, ask it to find the Prompt field, estimate coordinates, click. That works for a demo and falls over for production. Retina scaling doubles your coordinates. App updates move buttons. Dark mode changes OCR accuracy. Progress spinners confuse the vision layer. Accessibility tree reads the same semantic structure VoiceOver uses: role=textField, label='Positive prompt', x/y/w/h from the OS itself. It is deterministic, it survives theme changes, and it does not need a vision model in the loop at all. For an image app where a single Generate click can take 30 seconds to minutes, 'is the click going to land' is the difference between a useful batch and a wasted evening.

Do I have to keep everything local, or does Fazm call out to the cloud?

The image generation stays local. The image generator is the local app you picked, running its model on your Apple Silicon, writing pixels to your disk. Fazm's orchestration layer is configurable. The bridge exposes a Custom API Endpoint setting that maps to ANTHROPIC_BASE_URL on the subprocess; point it at a local OpenAI-compatible or Anthropic-compatible proxy (ollama, llama.cpp with an API shim, LM Studio) and the reasoning layer stays on-device as well. Default routing goes to a hosted model because most users do not want to configure a local proxy on day one. The decision is a one-field setting, not a rebuild.

Can Fazm open Draw Things, type a prompt, and click Generate by itself?

Yes, that is the exact shape of the work mcp-server-macos-use was built for. The tool family is the same one documented in the system prompt for Finder, Settings, and Mail: open an app by name, traverse its accessibility tree, find the element you want by role and label, click or type on it. For Draw Things specifically, that means focusing the Positive Prompt text area (role=TextArea), typing the prompt, clicking the Generate button (role=Button, label=Generate), waiting for the output element to appear, then reading the image path. None of those steps need a screenshot. All of them are deterministic against the app's published accessibility metadata.

What about ComfyUI? It has its own API, right?

Yes, ComfyUI exposes a REST API on localhost:8188, and for power users that is the cleanest integration point. You can also drive its UI through accessibility, but for ComfyUI specifically an agent should usually prefer the API. The orchestration story still matters: deciding what workflow to POST, reading the queue state, pulling results back, handing them to the next step. A consumer agent running on your Mac can do all of that and coordinate it with whatever else you are doing, which is the part that does not happen when you are driving ComfyUI by hand in a browser tab.

What does this mean for privacy? Nothing leaves my machine?

The image generator never sends pixels off your Mac. That has always been true for DiffusionBee, Draw Things, Mochi Diffusion, ComfyUI, and Klein; it is the point of a local image generator. Fazm's orchestration layer adds a second privacy decision. If you route the reasoning through a hosted model, the prompts and accessibility-tree observations travel to that provider. If you route through a local model via the Custom API Endpoint setting, nothing leaves. That split is intentional: the user picks, and the generator half stays local either way.

How is this different from Automator or AppleScript for driving image apps?

AppleScript and Automator work on apps that expose a scripting dictionary. DiffusionBee does not. Draw Things' AppleScript support is partial. Mochi Diffusion has minimal scriptable surface. The accessibility tree is a different surface: every native Mac app exposes it, because it is how assistive tech works. A driver built on AXUIElement can reach buttons and fields that no scripting dictionary mentions, and it survives app updates because accessibility metadata tends to be stable across releases. It is also what Apple itself uses for VoiceOver, so the coverage is deep.