Local AI video generator: two jobs, one noisy keyword
The top 2026 SERP results all answer one half of the question. HunyuanVideo, Wan 2.2, LTX Video, ComfyUI, a 24 GB NVIDIA card. That is the synthesis half. The other half, recording and reasoning over real screen video locally, is where a different stack lives, and where Fazm lives. This guide covers both, then walks down into the ffmpeg flags and Swift file paths that make the Mac-local half actually work.
THE CATEGORY SPLIT
Two jobs that share one keyword
The phrase "local AI video generator" used to mean one thing in 2024: Stable Video Diffusion on a big GPU. In 2026 it covers two distinct stacks. Most articles pick one and pretend the other does not exist. A clean split is the first thing the category needs.
Job A: synthesize pixels from a prompt
Diffusion and transformer models turning a text or image prompt into new video frames. HunyuanVideo (13B params), Wan 2.1, Wan 2.2, Wan 2.6, LTX Video, CogVideoX. Interface: ComfyUI almost universally. Hardware floor in 2026: 24 GB NVIDIA VRAM for the good ones. Operating system: Linux or Windows with CUDA. Apple Silicon support: partial, painful, and not the happy path.
Job B: record and reason over real video
Capture the screen locally, encode to a playable format, feed it into an AI that can actually understand what is happening and act. Hardware: any reasonably recent Mac. Interface: a consumer app, not ComfyUI. Fazm is the one this guide walks through in detail.
They do not substitute for each other
A synthesis model cannot watch your Notion window. An on-device recorder cannot hallucinate a new scene. Picking the wrong stack for the job is why people give up on the whole category.
What this guide does
Covers the synthesis names briefly so you know the landscape, then spends most of its oxygen on the capture-and-reason half, which almost no SERP result does. That half includes the exact ffmpeg flags and Swift file paths Fazm uses, verifiable from the MIT-licensed source tree.
JOB A, BRIEFLY
The synthesis shortlist, with the hardware truth attached
If your goal is to generate new video from a prompt on your own hardware, four open models dominate the 2026 coverage. All four assume a serious NVIDIA card and a ComfyUI workflow. None of them has a first-class Apple Silicon path.
HunyuanVideo (Tencent, 13B)
The quality leader on open weights. Cinematic motion, good temporal coherence. Practical floor: 24 GB VRAM, 3090 class or newer. Driven through ComfyUI with a node graph. Not usable on a MacBook without gymnastics.
Wan 2.1 / 2.2 / 2.6 (Alibaba)
The other 'good locally' family. Free, open weights, competitive with closed tools on short clips. Quoted floor in 2026 is still 24 GB VRAM, and fp16 wants more. RTX 5090 class is the comfortable fit.
LTX Video (Lightricks)
The speed-first option. Smaller memory footprint, fits on more cards, quality trails HunyuanVideo and Wan for final output. Good for iteration, concept testing, rough-cut previews.
CogVideoX, AnimateDiff, Stable Video Diffusion
Still-covered legacy choices. Useful for specific jobs (style-consistent shorts, img2vid), mostly displaced by the above for net-new work. Same ComfyUI, same NVIDIA floor.
If you do not already own an RTX 4090 or 5090 class card, the honest move is to use a hosted inference API for synthesis and stop trying to fit a 13B text-to-video model into laptop VRAM. The local synthesis story is real; it is just not a laptop story in April 2026.
“There is a second meaning of 'local AI video' that no roundup covers: an AI that records your real screen locally and reasons about it. Different stack, different hardware, different answer.”
Fazm screen observer, active-window capture
JOB B, IN DEPTH
The Mac-local capture pipeline, from pixels to chunks
The recording side of local AI video is the half that actually benefits from being on your own machine. Private work, internal apps, sensitive data, anything you would not paste into Descript or CloudApp, all of it survives intact only if the encoder runs locally. Here is how Fazm does it, with the file paths open for inspection.
Four things to notice in that block, because they are the differences from every other tool that markets itself under this keyword:
- framesPerSecond: 2.0. A sustained 2 FPS is enough for an AI to follow intent on a desktop UI, and cheap enough to run for hours without draining the battery. Most synthesis tools render at 24 or 30 FPS; a local observer does not need that.
- captureMode: .activeWindow. Only the frontmost window gets encoded. No multi-monitor bloat, no capturing your second screen while you answer a Slack DM on your third. The reasoning layer only ever sees the app you are working in.
- No backendURL, no backendSecret. The source code comment is literal: "local-only mode". The recorder will not emit HTTP traffic. The chunks live in ~/Library/Caches/observer-recordings/, your machine, your disk.
- deviceId = hardware UUID. Identity comes from IOPlatformExpertDevice through getHardwareUUID(), not a login. You do not sign in to start recording; the device key exists before you have an account.
THE ANCHOR
The exact ffmpeg invocation, copied from the source
This is the part nobody else in the "local AI video generator" SERP shows, because nobody else in that SERP is actually doing it on a Mac. It lives in the bundled macos-session-replay package at Sources/SessionReplay/VideoChunkEncoder.swift around line 232. Every flag is doing a job.
The three flags that make this a Mac story, not a generic story:
-vcodec hevc_videotoolbox
Routes encoding through Apple's hardware HEVC encoder. On Apple Silicon, that lives on the dedicated media engine next to the CPU and GPU, a separate block with its own power budget. A software encoder like libx264 or libx265 would pin a CPU core, spin up the fans, and compete with whatever app you are recording. This flag is the difference between "always-on background observer" and "hot laptop".
-tag:v hvc1 -movflags frag_keyframe+empty_moov+default_base_moof
Tags the stream as HVC1 (the QuickTime-friendly variant of HEVC) and writes a fragmented MP4 structure. Each 60 second chunk is playable the moment it is flushed, no finalization step required. That is what lets the analysis loop start consuming a chunk while the next one is still being recorded.
-realtime true -prio_speed true -q:v 65
Tells the encoder to prioritize latency over compression ratio, and caps quality at 65 (Apple's quality-controlled range, lower is higher quality). Good enough for an AI to read UI text and track layout, small enough to not turn your cache directory into a landing strip for gigabytes of video.
Where the chunks go and what happens next
The full data path from a pixel on your screen to an AI decision is five hops. None of them require the network unless you route the reasoning step that way.
From screen pixel to AI decision, fully on-device
The hub is the encoder. The left side is capture; the right side is reasoning and action. You can swap the right side (point the reasoning at a local model through an Anthropic-compatible proxy via the Custom API Endpoint setting, which sets ANTHROPIC_BASE_URL on the bridge process), and the left side keeps working the same way. That is what makes this a usable local stack and not a demo.
What you actually see on disk
Open a terminal on a Mac that has Fazm running and you can verify the pipeline is doing what the source says. No special tools, no debugging, just ls and file.
The numbers that matter for a real laptop
Compared with the synthesis side of the keyword, 0 GB of VRAM for a usable HunyuanVideo setup and 0+ minutes to generate a short clip, the capture-and-reason side is a different sport. Both are real, both are "local AI video", and they need different machines.
Where each tool actually fits
A head-to-head comparison for the job the SERP does not treat as one. Synthesis vs capture-and-reason, with the honest tradeoffs.
| Feature | ComfyUI + Wan/Hunyuan (synthesize) | Fazm (capture + reason) |
|---|---|---|
| Primary job | Generate new pixels from a prompt | Record real screen, feed an AI |
| Hardware floor | 24 GB NVIDIA VRAM, CUDA | Any Apple Silicon Mac |
| Interface | ComfyUI node graph | Consumer Mac app |
| Operating system | Linux or Windows | macOS |
| Encoding | N/A, output is a raw frame tensor | hevc_videotoolbox (hardware) |
| Offline capable | Yes, weights on disk | Capture yes, reasoning depends on endpoint |
| Setup time | Python env + ComfyUI + model downloads | DMG install, grant screen permission |
| Good answer when your question is | 'Make a 4 second clip of a dragon' | 'Make an AI watch me work locally' |
The tools are not competitors, they are complements. A serious local AI video setup can use both.
Adjacent tools worth naming
Short list of tools you will see in the same search, with a one-line note on what each one actually does so you do not pick the wrong stack for your job.
Picking the right stack in 60 seconds
You have a 3090/4090/5090
Install ComfyUI, pull HunyuanVideo or Wan 2.6 weights, run through the community node graphs. Stay on the synthesis track, it is what that hardware is for.
You have an Apple Silicon Mac
Synthesis is not your happy path in 2026. The capture-and-reason track is. Install Fazm, grant screen recording permission, let the 2 FPS observer run, and point the reasoning at whichever model you want through the Custom API Endpoint field.
You need clips for content
Hosted APIs win on cost-of-first-clip. Treat local synthesis as a hobby or a production line, not a casual tool.
You need the footage to stay on your machine
That is the whole point of the Fazm pipeline. hevc_videotoolbox, local cache, no backend URL. The source code enforces it.
Want to see the Mac-local video pipeline running on a real workflow?
Book 15 minutes with the team. We will open the cache folder, play a chunk in QuickTime, and walk through the reasoning loop end to end.
Book a call →Questions readers ask after this page
What do people usually mean by 'local AI video generator'?
In 2026 the phrase has split into two jobs that share a name. The first is synthesis: a model like HunyuanVideo, Wan 2.2, Wan 2.6 or LTX Video running on your own GPU, driven through ComfyUI, turning a text or image prompt into pixels. That is what almost every roundup covers. The second is capture and reasoning: an AI that records your real screen to local video files and makes decisions from what it sees, without shipping the footage to a cloud. That is a different category with different hardware requirements, and almost no article treats it as part of the same question.
What do the top synthesis models actually need to run locally?
HunyuanVideo is a 13B parameter text-to-video model and runs comfortably on 24 GB of VRAM. Wan 2.1 and Wan 2.6 are also usually quoted at 24 GB VRAM as the practical floor, with the fp16 versions wanting more. LTX Video is the lightest of the headline models and can fit on smaller cards at the cost of fidelity. In every case the assumed setup is NVIDIA, CUDA, a Python environment, and ComfyUI as the interface. None of them have first-class Apple Silicon support, and most of them will not generate useful clips on a laptop CPU. If you need synthesis and you do not already own a 3090, 4090, or 5090 class card, honest advice is to use a hosted API and stop fighting the stack.
Where does Fazm fit in, if Fazm is not a text-to-video model?
Fazm is a consumer Mac app that does the second job. It uses Apple's hardware HEVC encoder (hevc_videotoolbox) through a bundled ffmpeg to record the frontmost window at 2 FPS into H.265 chunks on disk, then pipes each 60 second chunk into an on-device analysis loop so an AI can reason about what you are doing and act on real apps. That pipeline runs in what the Fazm source code explicitly labels 'local-only' mode: no backend URL, no backend secret, just a cache directory on your Mac keyed by a hardware UUID. No login, no upload, no account.
Where in the Fazm source does the local video pipeline actually live?
Two files. The orchestration is Desktop/Sources/SessionRecordingManager.swift in the startScreenObserver() function, roughly lines 56 to 117, which constructs a SessionRecorder configuration with framesPerSecond: 2.0, chunkDurationSeconds: 60.0, captureMode: .activeWindow, a storage URL under Library/Caches/observer-recordings/, and a deviceId derived from IOPlatformExpertDevice via getHardwareUUID(). The comment 'No backendURL/backendSecret, local-only mode' is in the source. The actual encoder invocation lives in the bundled macos-session-replay package at Sources/SessionReplay/VideoChunkEncoder.swift around line 232, where the ffmpeg arguments include '-vcodec', 'hevc_videotoolbox', '-tag:v', 'hvc1', '-q:v', '65', '-allow_sw', 'true', '-realtime', 'true', '-prio_speed', 'true', and '-movflags', 'frag_keyframe+empty_moov+default_base_moof'. Those are the flags that make the encoder use the Apple Silicon media engine and keep each chunk playable without finalization.
Why does using hevc_videotoolbox matter vs libx264 or libx265?
On Apple Silicon, hevc_videotoolbox runs on the dedicated media engine that lives next to the CPU and GPU on the M-series die. It does not spin up the fans, does not pin a core at 100 percent, and does not compete with whatever app you are recording. A software encoder like libx264 or libx265 would take the same job to the CPU and pay a significant power cost, especially at sustained multi-hour sessions. For an always-on local observer that ships to Gemini or a local model for analysis, the difference is the line between 'background process you forget' and 'laptop on a desk getting warm'. That is why the flag set is there.
Can I use Fazm to generate synthetic video from a prompt?
No. That is the honest answer. Fazm does not ship HunyuanVideo or Wan, does not embed a diffusion pipeline, and does not run ComfyUI workflows. If your goal is to produce new pixels from a text prompt on your own hardware, you want a GPU-first stack and you should read one of the many ComfyUI-on-RTX guides. What Fazm does is a complementary thing. It generates video from your real screen, locally, so an AI can reason about your real work. Two different jobs, one noisy keyword.
Does any of this run while I am offline?
The recording pipeline does. Screen capture via ScreenCaptureKit, HEVC encoding via the Apple media engine, chunk storage to Library/Caches/observer-recordings, and deviceId lookup via IOPlatformExpertDevice are all fully offline. Whether the reasoning layer needs network depends on which model you have configured. Routing analysis through Gemini or a remote API needs network. Routing through a local model through a proxy does not. Fazm exposes a single 'Custom API Endpoint' field in Settings (it maps to ANTHROPIC_BASE_URL on the bridge subprocess) that lets you point the reasoning step at whatever you like, including an on-device shim.
How does this compare to using a cloud service like Descript or a plain QuickTime recording?
Descript and its peers upload your footage to a service, then run AI analysis or editing against it in their cloud. That is fine for finished content, awful for private work on internal apps or regulated data. QuickTime records locally but has no AI layer and no chunking, you get a single MOV file with nothing watching it. Fazm sits in the gap: local encoding (hevc_videotoolbox through ffmpeg), chunking every 60 seconds so the analysis loop has something to work with in near-real time, and a reasoning layer that can stay on-device if you route it that way. It is the stack, not the format, that is the distinction.
What happens to the recordings? Do they fill up my disk?
They live in Library/Caches/observer-recordings under your user account, which is a standard macOS cache location the system is allowed to evict under pressure. At 2 FPS active-window capture encoded in H.265, sustained chunk sizes are small, much smaller than typical video because the frame rate is low and the source is mostly static UI. Fazm does not keep chunks forever; it is a rolling buffer for the reasoning layer. The folder is inspectable. You can open it, play a chunk in QuickTime, and delete anything you want.