Local image to video AI: two routes, one flag that decides your battery
Route one is diffusion i2v: one still, a motion prompt, a 24 GB NVIDIA card, ComfyUI. That is what the 2026 SERP covers. Route two is image-sequence to video: a stream of real images going into a hardware HEVC encoder locally, with an AI watching the chunks on the other side. That one is almost never covered, and it is where the Fazm pipeline actually lives. Both are real. They do not substitute for each other, and they do not want the same machine.
THE CATEGORY SPLIT
One keyword, two very different jobs
Read any 2026 article ranking for "local image to video AI" and you will get the diffusion roster: HunyuanVideo-I2V, Wan-i2v, LTX-Video, CogVideoX-I2V, Stable Video Diffusion. All running through ComfyUI, all assuming a 24 GB NVIDIA card. What the roundups never separate out is that the phrase also covers a much simpler job: a stream of real images becoming a playable local video so an AI can watch it. That second job is everywhere in real workflows and nowhere in the SERP.
Route A: diffusion image-to-video
A single still image plus a motion prompt fed into a diffusion or transformer model. Output: net-new frames that did not exist before. Examples: HunyuanVideo-I2V, Wan-i2v (2.1/2.2/2.6), LTX-Video, CogVideoX-I2V, Stable Video Diffusion. Interface: ComfyUI. Hardware floor: 12-24 GB NVIDIA VRAM depending on model. Apple Silicon path: painful, partial, not the happy road.
Route B: image-sequence to video
A stream of images you already have (screenshots, chart frames, renders, time-lapse photos) piped into a local hardware video encoder. Output: a real playable MP4 or MOV that any VLM can read. Hardware: any Apple Silicon Mac. Interface: a consumer app, not a node graph.
They solve different problems
Route A invents motion that was not there. Route B preserves motion that was. Picking the wrong one for your job is why people install ComfyUI for a task that actually wanted ffmpeg and a 30-line Swift file.
What this guide does
Names the Route A shortlist and the honest hardware cost, then opens up Route B at the source level. You get the exact ffmpeg invocation Fazm uses, the CGImage API signature, the line-220 comment that explains the BGRA-over-PNG decision, and the one flag that keeps the encoder on the Apple media engine.
ROUTE A, BRIEFLY
The diffusion i2v shortlist, with the hardware truth attached
If you genuinely need to hallucinate new frames from a single still image on your own hardware, here is the honest 2026 landscape. All assume an NVIDIA card, CUDA, Python, and ComfyUI. None of them has a first-class Apple Silicon inference path.
HunyuanVideo-I2V (Tencent, 13B backbone)
The quality leader for open weights image-to-video. Cinematic motion, strong temporal coherence, still the one most demo reels are cut from. Practical floor: 24 GB VRAM, 3090 class or newer. ComfyUI node graph for the workflow.
Wan-i2v 2.1 / 2.2 / 2.6 (Alibaba)
The other 'good at i2v' family. Open weights, competitive with closed tools on short clips. fp8 path fits 24 GB; fp16 wants more. RTX 5090 class is the comfortable target. Each minor version ships a cleaner i2v preset.
LTX-Video-i2v (Lightricks)
Fastest of the headliners, smaller memory footprint, fits on 12-16 GB at reduced quality. Good for iteration, rough-cut previews, concept testing. Fidelity trails HunyuanVideo-I2V for final output.
Stable Video Diffusion (Stability AI)
The original open i2v model. Runs on 8-12 GB but caps at roughly 2-4 second clips and lower resolution. Useful if your card is modest and the clip is short. Superseded by the above for longer, higher-quality output.
CogVideoX-I2V (Zhipu/Tsinghua)
The 'still shipped, still competitive on some clips' option. 16-24 GB depending on variant. Often the right pick when HunyuanVideo or Wan produce motion you do not want.
The honest summary: if you do not already own a 3090, 4090 or 5090 class NVIDIA card, Route A is not a laptop story in 2026. It is a workstation-or-cloud story. Running a 13B i2v model on a MacBook is technically possible through Metal Performance Shaders and Apple's unified memory trick, but the time-per-clip turns a creative tool into a daily chore.
“There is a second meaning of 'local image to video AI' the SERP almost never covers: take images you already have, encode them locally, and let the AI reason about the resulting video.”
-vcodec hevc_videotoolbox, the line that decides your battery
ROUTE B, IN DEPTH
The image-sequence pipeline, in Swift, with the one-line comment that explains everything
The public API is a single function. It takes a CGImage and a timestamp, and returns when the frame has been handed to the encoder. This is the image-to-video function, in the plainest possible shape.
Behind that signature are two engineering decisions that are worth reading one step at a time. Both are stated explicitly in the source, not implied.
Decision 1: raw BGRA stdin, not per-frame PNG
The comment on line 220 is verbatim: "Use raw BGRA pixel input instead of PNG to avoid expensive per-frame PNG encoding". PNG is a DEFLATE-compressed image format; encoding a PNG for every captured frame would add a CPU-bound compression step before the frame even reaches the video encoder, and the encoder would just decompress it right back. Drawing the CGImage directly into a BGRA CGContext and streaming those bytes to ffmpeg's stdin skips both steps. This is the single line that makes the pipeline cheap enough to run sustained.
Decision 2: interpolationQuality = .low
Line 282: context.interpolationQuality = .low. The CGContext draw uses low-quality interpolation because the downstream consumer is a vision-language model reading UI text and layout, not a human watching a 4K playback. Low interpolation plus the BGRA byte-order trick are what allow the per-frame draw to stay off the hot path. A bicubic interpolation setting at 2 FPS for hours would add a measurable CPU cost; the source opts out on purpose.
THE ANCHOR INVOCATION
The exact ffmpeg command that turns images into a local video
This block lives in Desktop/.build/checkouts/macos-session-replay/Sources/SessionReplay/VideoChunkEncoder.swift, lines 226 through 242. It is what the public addFrame(image:) call actually feeds. No frame survives the pipeline without going through it.
The top half reads the images. The middle configures the Apple Silicon hardware encoder. The bottom makes the output chunk playable without a finalization pass. Here is what each cluster is doing.
What each flag cluster does
- -f rawvideo -pixel_format bgra -video_size WxH -r 2 -i -: tells ffmpeg 'the images are raw BGRA pixels, at these dimensions, at 2 FPS, on stdin'. This is the image-to-video input side.
- -vcodec hevc_videotoolbox: routes encoding to Apple's dedicated media engine. The one flag that decides whether the pipeline costs watts or pegs a CPU core.
- -tag:v hvc1: marks the stream as QuickTime-friendly HEVC. Without this, Finder preview and QuickTime will not play the resulting chunk.
- -q:v 65 -allow_sw true: quality target inside VideoToolbox's constant-quality range (lower is higher quality on Apple's scale), with a software fallback if the media engine is unavailable.
- -realtime true -prio_speed true: latency-first, not ratio-first. The chunk needs to be ready when the AI wants to read it.
- -movflags frag_keyframe+empty_moov+default_base_moof: writes a fragmented MP4 with no final moov atom. Each chunk is playable the instant it is flushed.
From CGImage to a reasoning step, in one diagram
Five hops separate an image in memory from an AI decision. None of them require the network, unless you route the reasoning step that way on purpose.
Image-to-video, fully local
The hub is the encoder. The left side is image ingestion; the right side is reasoning and action. Swap the right side for whatever you like through the Custom API Endpoint field (which sets ANTHROPIC_BASE_URL on the bridge subprocess), and the image-to-video half keeps working the same way.
The honest comparison, side by side
If you toggle between the two routes below, what changes is not quality or creativity. It is which problem the stack is built to solve and what machine you need to solve it on.
Same keyword, two different stacks
One image plus a motion prompt. A 13B text-to-video model hallucinates roughly 2-5 seconds of new frames. ComfyUI is the interface. The only realistic path in 2026 is a 24 GB NVIDIA card or a hosted API.
- Needs 24 GB VRAM (or close) for the quality leaders
- No first-class Apple Silicon path in 2026
- Per-clip latency is minutes, not seconds
- Hallucinates motion that was not there
What it looks like on disk
Route B is the rare AI pipeline you can audit with ls and file. Open a terminal on a Mac that has Fazm running and the output is unambiguous.
The numbers that matter if this is running on your actual laptop
For contrast on Route A, HunyuanVideo-I2V wants roughly 0 GB of VRAM and will happily consume 0+ minutes per clip on a 3090. Route B sustains 0 FPS on a MacBook for an entire work session without the fans spinning up. These are not competing benchmarks. They are measurements for two different sports wearing the same jersey.
Where each tool actually fits
Side-by-side on the primary axes the SERP skips. Synthesis vs image-sequence-to-local-video, honestly scored.
| Feature | ComfyUI + HunyuanVideo-I2V / Wan-i2v | Fazm (image stream → local video + AI) |
|---|---|---|
| Primary job | Hallucinate new frames from one still | Encode a stream of real images, feed an AI |
| Hardware floor | 24 GB NVIDIA VRAM, CUDA | Any Apple Silicon Mac |
| Encoder | N/A, output is a raw frame tensor | hevc_videotoolbox on the Apple media engine |
| Input | One still + motion prompt | CGImage at 2 FPS (screen, images, renders) |
| Interface | ComfyUI with a graph of nodes | Consumer Mac app, no node graph |
| Latency per unit of output | Minutes per 2-5 s clip | ~60 s per chunk (fragmented MP4) |
| Typical power draw | High, full GPU utilisation | Low, encoder on dedicated media engine |
| Good answer when your question is | 'Make a 4 second clip of a dragon flying' | 'Make an AI watch my real screen or image stream' |
Not competitors. The full local AI video stack can want both, for two different jobs.
SEQUENCED WALKTHROUGH
What happens during one 60 second chunk
One chunk, start to finish
Frame arrives
Adjacent tools you will see in the same SERP
A short one-line map of every tool that competes for this keyword, split by which route it actually serves so you do not install the wrong one.
Picking the right route in under a minute
You own a 3090 / 4090 / 5090
Route A. Install ComfyUI, pull the HunyuanVideo-I2V or Wan-i2v weights, follow the community graphs. That hardware is what diffusion video was built for.
You are on an Apple Silicon Mac
Route B. Install Fazm, grant screen recording permission, let the 2 FPS observer run, and point the reasoning step wherever you like through the Custom API Endpoint field.
You have a folder full of images already
Route B is almost certainly what you want. The underlying ffmpeg pattern (-f rawvideo -pixel_format bgra -i -) works for any CGImage source. Fazm productises screen; the pattern generalises.
You need short creative clips of new motion
Route A. Hosted APIs win on cost-of-first-clip. Local diffusion is a production line if you own the card, a hobby if you do not.
Want to see the image-stream-to-local-video pipeline on your own Mac?
Book 15 minutes. We will open observer-recordings, play a chunk in QuickTime, and trace one CGImage from the SessionRecorder into the HEVC chunk.
Book a call →Questions readers ask after this page
What does 'local image to video AI' actually cover in 2026?
Two different stacks that share a phrase. Stack one is diffusion image-to-video: a single still image plus a motion prompt fed into HunyuanVideo-I2V, Wan-i2v, LTX-Video, CogVideoX-I2V or Stable Video Diffusion, hallucinating new frames on a 24 GB NVIDIA card. Stack two is image-sequence to video: a stream of real images (screenshots, chart frames, design iterations, time-lapse photos) piped into a local hardware encoder so an on-device AI can reason about the resulting video. Almost every article treats the first stack as the only answer. It is not.
What GPU do the diffusion image-to-video models actually need to run locally?
HunyuanVideo-I2V is built on the 13B HunyuanVideo backbone and runs comfortably on 24 GB VRAM, 3090 class or newer. Wan-i2v (2.1, 2.2, 2.6) is also quoted at 24 GB for the fp8 path, and significantly more for fp16. LTX-Video is the lightest of the headline models and will fit on 12-16 GB at reduced quality. Stable Video Diffusion (SVD) will run on 8-12 GB but caps at roughly 2 second clips. CogVideoX-I2V needs 16-24 GB depending on the variant. In every case the assumption is an NVIDIA card, CUDA, Python, and ComfyUI as the front end. None of these models have a first-class Apple Silicon path in April 2026.
What is the 'image-sequence to video' route and why does nobody cover it?
Because it is not glamorous. You start with a sequence of images you already have (screenshots, dashboard frames every N seconds, rendered design iterations, time-lapse photos, receipt scans, medical slice exports), pipe them into a local hardware video encoder, and the output is a real playable video file that any vision-language model can reason about. No diffusion, no hallucination, no GPU pressure. The reason nobody covers it is that it does not need a 5090 card, so it does not feature in 'best local video AI' roundups. Fazm's screen observer is exactly this pipeline, productised on a Mac.
What is the anchor fact that makes this page uncopyable?
A one-line comment in the Fazm source tree. The file Desktop/.build/checkouts/macos-session-replay/Sources/SessionReplay/VideoChunkEncoder.swift at line 220 says: 'Use raw BGRA pixel input instead of PNG to avoid expensive per-frame PNG encoding'. That comment lives directly above the ffmpeg invocation on lines 226-242 which uses the arguments '-f rawvideo -pixel_format bgra -r 2 -i - -vcodec hevc_videotoolbox -tag:v hvc1 -q:v 65 -allow_sw true -realtime true -prio_speed true'. The public API that feeds it is at line 86: 'public func addFrame(image: CGImage, timestamp: Date) async throws -> EncodedFrame?'. That is the image-to-video function, in plain Swift, running locally on your Mac.
Why is raw BGRA faster than writing each frame as PNG?
PNG is a compressed, DEFLATE-based image format. Encoding a PNG on every captured frame adds a CPU-bound compression step before the frame even reaches the video encoder, and then the video encoder immediately has to decompress it back to pixels. Raw BGRA skips both steps. The Fazm pipeline draws the CGImage into a BGRA CGContext with interpolationQuality set to low (line 282), then writes the raw bytes to ffmpeg stdin. That data is already in the exact memory layout hevc_videotoolbox wants. The comment on line 220 is not a stylistic preference, it is the practical reason the pipeline can run at 2 FPS sustained for hours without pinning a core.
Which ffmpeg flag actually determines whether your laptop gets hot?
-vcodec hevc_videotoolbox. On Apple Silicon, that one flag routes encoding to the dedicated media engine on the M-series die, a block that exists next to the CPU and GPU with its own power budget. The alternative flags, -vcodec libx264 or -vcodec libx265, would take the job to the CPU and pay a sustained multi-watt cost for multi-hour sessions. For a continuously running image-sequence-to-video pipeline, that single substitution is the line between 'background observer you forget about' and 'laptop that runs hot after an hour'. libx265 software encoding of a 24 FPS 1080p stream can easily saturate two CPU cores; hevc_videotoolbox at 2 FPS barely registers in Activity Monitor.
Can Fazm generate synthetic video from a single image, like SVD or HunyuanVideo-I2V?
No. Fazm does not ship a diffusion model, does not embed a UNet, and does not run ComfyUI workflows. If your goal is to produce new pixels from a single still, you want the diffusion stack with an NVIDIA GPU. What Fazm does is the other half of the keyword: take a real stream of images you already have (most often a captured screen) and produce a local H.265 video file that the reasoning layer can watch. Two jobs, one keyword, and they do not substitute for each other.
Can I use the same pipeline for my own image directory, not just screenshots?
The Fazm consumer app wires the pipeline to screen capture because that is the product. But the underlying flow is general: CGImage in, HEVC chunk out, via ffmpeg stdin and hevc_videotoolbox. If you are a developer building a local image-to-video workflow on a Mac (time-lapse from a folder of JPEGs, frame-by-frame render review, animated chart exports), the same '-f rawvideo -pixel_format bgra -i -' pattern works. The productised version, where an AI reasoning layer sits on top of the chunks, is what Fazm packages for non-developers.
Why 2 FPS, not 24 or 30?
Because the downstream consumer is a vision-language model reasoning about UI state, not a human watching a cinematic playback. At 2 FPS the model still sees every meaningful transition (a click, a scroll, a new modal opening), the encoded chunks are roughly an order of magnitude smaller than a 24 FPS equivalent, and the capture loop never has to compete for CPU with whatever app you are recording. The SessionRecorder configuration literally sets framesPerSecond: 2.0 with a comment 'Lower FPS than research recorder to save CPU'. This is the tax you should not be paying if your use case is 'AI watches the screen', and most roundup-level tooling forces a higher default because the defaults were written for humans.
Does the reasoning layer stay local too, or does it call a cloud model?
The capture pipeline is fully local: ScreenCaptureKit, a BGRA CGContext, ffmpeg with hevc_videotoolbox, and chunks written under Library/Caches/observer-recordings. The reasoning step is configurable. Out of the box, Fazm can route analysis through Gemini, which is a network call. If you want to stay fully offline, Fazm exposes a Custom API Endpoint field in Settings that sets ANTHROPIC_BASE_URL on the bridge subprocess, so you can point reasoning at an Anthropic-compatible proxy in front of a local model (Ollama, LM Studio, llama.cpp server). The image-to-video half does not care where reasoning runs; it just produces chunks.