APRIL 2026 / LOCAL I2V, WITH THE SWIFT RECEIPTS

Local image to video AI: two routes, one flag that decides your battery

Route one is diffusion i2v: one still, a motion prompt, a 24 GB NVIDIA card, ComfyUI. That is what the 2026 SERP covers. Route two is image-sequence to video: a stream of real images going into a hardware HEVC encoder locally, with an AI watching the chunks on the other side. That one is almost never covered, and it is where the Fazm pipeline actually lives. Both are real. They do not substitute for each other, and they do not want the same machine.

M
Matthew Diakonov
10 min read
4.9from Written from the Fazm source tree
CGImage → BGRA → hevc_videotoolbox
Line 220: raw BGRA beats per-frame PNG
2 FPS, 60 second chunks, Apple media engine
Diffusion i2v models compared honestly
File paths and line numbers cited

THE CATEGORY SPLIT

One keyword, two very different jobs

Read any 2026 article ranking for "local image to video AI" and you will get the diffusion roster: HunyuanVideo-I2V, Wan-i2v, LTX-Video, CogVideoX-I2V, Stable Video Diffusion. All running through ComfyUI, all assuming a 24 GB NVIDIA card. What the roundups never separate out is that the phrase also covers a much simpler job: a stream of real images becoming a playable local video so an AI can watch it. That second job is everywhere in real workflows and nowhere in the SERP.

Route A: diffusion image-to-video

A single still image plus a motion prompt fed into a diffusion or transformer model. Output: net-new frames that did not exist before. Examples: HunyuanVideo-I2V, Wan-i2v (2.1/2.2/2.6), LTX-Video, CogVideoX-I2V, Stable Video Diffusion. Interface: ComfyUI. Hardware floor: 12-24 GB NVIDIA VRAM depending on model. Apple Silicon path: painful, partial, not the happy road.

Route B: image-sequence to video

A stream of images you already have (screenshots, chart frames, renders, time-lapse photos) piped into a local hardware video encoder. Output: a real playable MP4 or MOV that any VLM can read. Hardware: any Apple Silicon Mac. Interface: a consumer app, not a node graph.

They solve different problems

Route A invents motion that was not there. Route B preserves motion that was. Picking the wrong one for your job is why people install ComfyUI for a task that actually wanted ffmpeg and a 30-line Swift file.

What this guide does

Names the Route A shortlist and the honest hardware cost, then opens up Route B at the source level. You get the exact ffmpeg invocation Fazm uses, the CGImage API signature, the line-220 comment that explains the BGRA-over-PNG decision, and the one flag that keeps the encoder on the Apple media engine.

ROUTE A, BRIEFLY

The diffusion i2v shortlist, with the hardware truth attached

If you genuinely need to hallucinate new frames from a single still image on your own hardware, here is the honest 2026 landscape. All assume an NVIDIA card, CUDA, Python, and ComfyUI. None of them has a first-class Apple Silicon inference path.

1

HunyuanVideo-I2V (Tencent, 13B backbone)

The quality leader for open weights image-to-video. Cinematic motion, strong temporal coherence, still the one most demo reels are cut from. Practical floor: 24 GB VRAM, 3090 class or newer. ComfyUI node graph for the workflow.

2

Wan-i2v 2.1 / 2.2 / 2.6 (Alibaba)

The other 'good at i2v' family. Open weights, competitive with closed tools on short clips. fp8 path fits 24 GB; fp16 wants more. RTX 5090 class is the comfortable target. Each minor version ships a cleaner i2v preset.

3

LTX-Video-i2v (Lightricks)

Fastest of the headliners, smaller memory footprint, fits on 12-16 GB at reduced quality. Good for iteration, rough-cut previews, concept testing. Fidelity trails HunyuanVideo-I2V for final output.

4

Stable Video Diffusion (Stability AI)

The original open i2v model. Runs on 8-12 GB but caps at roughly 2-4 second clips and lower resolution. Useful if your card is modest and the clip is short. Superseded by the above for longer, higher-quality output.

5

CogVideoX-I2V (Zhipu/Tsinghua)

The 'still shipped, still competitive on some clips' option. 16-24 GB depending on variant. Often the right pick when HunyuanVideo or Wan produce motion you do not want.

The honest summary: if you do not already own a 3090, 4090 or 5090 class NVIDIA card, Route A is not a laptop story in 2026. It is a workstation-or-cloud story. Running a 13B i2v model on a MacBook is technically possible through Metal Performance Shaders and Apple's unified memory trick, but the time-per-clip turns a creative tool into a daily chore.

1 flag

There is a second meaning of 'local image to video AI' the SERP almost never covers: take images you already have, encode them locally, and let the AI reason about the resulting video.

-vcodec hevc_videotoolbox, the line that decides your battery

ROUTE B, IN DEPTH

The image-sequence pipeline, in Swift, with the one-line comment that explains everything

The public API is a single function. It takes a CGImage and a timestamp, and returns when the frame has been handed to the encoder. This is the image-to-video function, in the plainest possible shape.

VideoChunkEncoder.swift (line 86)

Behind that signature are two engineering decisions that are worth reading one step at a time. Both are stated explicitly in the source, not implied.

Decision 1: raw BGRA stdin, not per-frame PNG

The comment on line 220 is verbatim: "Use raw BGRA pixel input instead of PNG to avoid expensive per-frame PNG encoding". PNG is a DEFLATE-compressed image format; encoding a PNG for every captured frame would add a CPU-bound compression step before the frame even reaches the video encoder, and the encoder would just decompress it right back. Drawing the CGImage directly into a BGRA CGContext and streaming those bytes to ffmpeg's stdin skips both steps. This is the single line that makes the pipeline cheap enough to run sustained.

Decision 2: interpolationQuality = .low

Line 282: context.interpolationQuality = .low. The CGContext draw uses low-quality interpolation because the downstream consumer is a vision-language model reading UI text and layout, not a human watching a 4K playback. Low interpolation plus the BGRA byte-order trick are what allow the per-frame draw to stay off the hot path. A bicubic interpolation setting at 2 FPS for hours would add a measurable CPU cost; the source opts out on purpose.

THE ANCHOR INVOCATION

The exact ffmpeg command that turns images into a local video

This block lives in Desktop/.build/checkouts/macos-session-replay/Sources/SessionReplay/VideoChunkEncoder.swift, lines 226 through 242. It is what the public addFrame(image:) call actually feeds. No frame survives the pipeline without going through it.

VideoChunkEncoder.swift (line 226-242)

The top half reads the images. The middle configures the Apple Silicon hardware encoder. The bottom makes the output chunk playable without a finalization pass. Here is what each cluster is doing.

What each flag cluster does

  • -f rawvideo -pixel_format bgra -video_size WxH -r 2 -i -: tells ffmpeg 'the images are raw BGRA pixels, at these dimensions, at 2 FPS, on stdin'. This is the image-to-video input side.
  • -vcodec hevc_videotoolbox: routes encoding to Apple's dedicated media engine. The one flag that decides whether the pipeline costs watts or pegs a CPU core.
  • -tag:v hvc1: marks the stream as QuickTime-friendly HEVC. Without this, Finder preview and QuickTime will not play the resulting chunk.
  • -q:v 65 -allow_sw true: quality target inside VideoToolbox's constant-quality range (lower is higher quality on Apple's scale), with a software fallback if the media engine is unavailable.
  • -realtime true -prio_speed true: latency-first, not ratio-first. The chunk needs to be ready when the AI wants to read it.
  • -movflags frag_keyframe+empty_moov+default_base_moof: writes a fragmented MP4 with no final moov atom. Each chunk is playable the instant it is flushed.

From CGImage to a reasoning step, in one diagram

Five hops separate an image in memory from an AI decision. None of them require the network, unless you route the reasoning step that way on purpose.

Image-to-video, fully local

ScreenCaptureKit
CGContext BGRA draw
2 FPS sampler
ffmpeg + hevc_videotoolbox
observer-recordings
Vision-language model
Agent action

The hub is the encoder. The left side is image ingestion; the right side is reasoning and action. Swap the right side for whatever you like through the Custom API Endpoint field (which sets ANTHROPIC_BASE_URL on the bridge subprocess), and the image-to-video half keeps working the same way.

The honest comparison, side by side

If you toggle between the two routes below, what changes is not quality or creativity. It is which problem the stack is built to solve and what machine you need to solve it on.

Same keyword, two different stacks

One image plus a motion prompt. A 13B text-to-video model hallucinates roughly 2-5 seconds of new frames. ComfyUI is the interface. The only realistic path in 2026 is a 24 GB NVIDIA card or a hosted API.

  • Needs 24 GB VRAM (or close) for the quality leaders
  • No first-class Apple Silicon path in 2026
  • Per-clip latency is minutes, not seconds
  • Hallucinates motion that was not there

What it looks like on disk

Route B is the rare AI pipeline you can audit with ls and file. Open a terminal on a Mac that has Fazm running and the output is unambiguous.

inspecting a local image-to-video session

The numbers that matter if this is running on your actual laptop

0 FPSSustained image intake
0 sChunk duration
q=0VideoToolbox quality target
0 uploadsIn local-only mode

For contrast on Route A, HunyuanVideo-I2V wants roughly 0 GB of VRAM and will happily consume 0+ minutes per clip on a 3090. Route B sustains 0 FPS on a MacBook for an entire work session without the fans spinning up. These are not competing benchmarks. They are measurements for two different sports wearing the same jersey.

Where each tool actually fits

Side-by-side on the primary axes the SERP skips. Synthesis vs image-sequence-to-local-video, honestly scored.

FeatureComfyUI + HunyuanVideo-I2V / Wan-i2vFazm (image stream → local video + AI)
Primary jobHallucinate new frames from one stillEncode a stream of real images, feed an AI
Hardware floor24 GB NVIDIA VRAM, CUDAAny Apple Silicon Mac
EncoderN/A, output is a raw frame tensorhevc_videotoolbox on the Apple media engine
InputOne still + motion promptCGImage at 2 FPS (screen, images, renders)
InterfaceComfyUI with a graph of nodesConsumer Mac app, no node graph
Latency per unit of outputMinutes per 2-5 s clip~60 s per chunk (fragmented MP4)
Typical power drawHigh, full GPU utilisationLow, encoder on dedicated media engine
Good answer when your question is'Make a 4 second clip of a dragon flying''Make an AI watch my real screen or image stream'

Not competitors. The full local AI video stack can want both, for two different jobs.

SEQUENCED WALKTHROUGH

What happens during one 60 second chunk

One chunk, start to finish

01 / 06

Frame arrives

ScreenCaptureKit hands over a CGImage of the frontmost window. Timestamp recorded.

Adjacent tools you will see in the same SERP

A short one-line map of every tool that competes for this keyword, split by which route it actually serves so you do not install the wrong one.

HunyuanVideo-I2V: 13B i2v on 24 GB NVIDIA
Wan-i2v 2.2 / 2.6: Alibaba open i2v family
LTX-Video-i2v: speed-first image to video
Stable Video Diffusion: short clips on modest GPUs
CogVideoX-I2V: Zhipu open i2v model
ComfyUI: the node graph for diffusion i2v
Pinokio: one-click installer for diffusion stacks
AnimateDiff: motion adapter for SD image models
ffmpeg + hevc_videotoolbox: Route B encoder
Fazm: image stream → local video + on-device AI

Picking the right route in under a minute

You own a 3090 / 4090 / 5090

Route A. Install ComfyUI, pull the HunyuanVideo-I2V or Wan-i2v weights, follow the community graphs. That hardware is what diffusion video was built for.

You are on an Apple Silicon Mac

Route B. Install Fazm, grant screen recording permission, let the 2 FPS observer run, and point the reasoning step wherever you like through the Custom API Endpoint field.

You have a folder full of images already

Route B is almost certainly what you want. The underlying ffmpeg pattern (-f rawvideo -pixel_format bgra -i -) works for any CGImage source. Fazm productises screen; the pattern generalises.

You need short creative clips of new motion

Route A. Hosted APIs win on cost-of-first-clip. Local diffusion is a production line if you own the card, a hobby if you do not.

Want to see the image-stream-to-local-video pipeline on your own Mac?

Book 15 minutes. We will open observer-recordings, play a chunk in QuickTime, and trace one CGImage from the SessionRecorder into the HEVC chunk.

Book a call

Questions readers ask after this page

What does 'local image to video AI' actually cover in 2026?

Two different stacks that share a phrase. Stack one is diffusion image-to-video: a single still image plus a motion prompt fed into HunyuanVideo-I2V, Wan-i2v, LTX-Video, CogVideoX-I2V or Stable Video Diffusion, hallucinating new frames on a 24 GB NVIDIA card. Stack two is image-sequence to video: a stream of real images (screenshots, chart frames, design iterations, time-lapse photos) piped into a local hardware encoder so an on-device AI can reason about the resulting video. Almost every article treats the first stack as the only answer. It is not.

What GPU do the diffusion image-to-video models actually need to run locally?

HunyuanVideo-I2V is built on the 13B HunyuanVideo backbone and runs comfortably on 24 GB VRAM, 3090 class or newer. Wan-i2v (2.1, 2.2, 2.6) is also quoted at 24 GB for the fp8 path, and significantly more for fp16. LTX-Video is the lightest of the headline models and will fit on 12-16 GB at reduced quality. Stable Video Diffusion (SVD) will run on 8-12 GB but caps at roughly 2 second clips. CogVideoX-I2V needs 16-24 GB depending on the variant. In every case the assumption is an NVIDIA card, CUDA, Python, and ComfyUI as the front end. None of these models have a first-class Apple Silicon path in April 2026.

What is the 'image-sequence to video' route and why does nobody cover it?

Because it is not glamorous. You start with a sequence of images you already have (screenshots, dashboard frames every N seconds, rendered design iterations, time-lapse photos, receipt scans, medical slice exports), pipe them into a local hardware video encoder, and the output is a real playable video file that any vision-language model can reason about. No diffusion, no hallucination, no GPU pressure. The reason nobody covers it is that it does not need a 5090 card, so it does not feature in 'best local video AI' roundups. Fazm's screen observer is exactly this pipeline, productised on a Mac.

What is the anchor fact that makes this page uncopyable?

A one-line comment in the Fazm source tree. The file Desktop/.build/checkouts/macos-session-replay/Sources/SessionReplay/VideoChunkEncoder.swift at line 220 says: 'Use raw BGRA pixel input instead of PNG to avoid expensive per-frame PNG encoding'. That comment lives directly above the ffmpeg invocation on lines 226-242 which uses the arguments '-f rawvideo -pixel_format bgra -r 2 -i - -vcodec hevc_videotoolbox -tag:v hvc1 -q:v 65 -allow_sw true -realtime true -prio_speed true'. The public API that feeds it is at line 86: 'public func addFrame(image: CGImage, timestamp: Date) async throws -> EncodedFrame?'. That is the image-to-video function, in plain Swift, running locally on your Mac.

Why is raw BGRA faster than writing each frame as PNG?

PNG is a compressed, DEFLATE-based image format. Encoding a PNG on every captured frame adds a CPU-bound compression step before the frame even reaches the video encoder, and then the video encoder immediately has to decompress it back to pixels. Raw BGRA skips both steps. The Fazm pipeline draws the CGImage into a BGRA CGContext with interpolationQuality set to low (line 282), then writes the raw bytes to ffmpeg stdin. That data is already in the exact memory layout hevc_videotoolbox wants. The comment on line 220 is not a stylistic preference, it is the practical reason the pipeline can run at 2 FPS sustained for hours without pinning a core.

Which ffmpeg flag actually determines whether your laptop gets hot?

-vcodec hevc_videotoolbox. On Apple Silicon, that one flag routes encoding to the dedicated media engine on the M-series die, a block that exists next to the CPU and GPU with its own power budget. The alternative flags, -vcodec libx264 or -vcodec libx265, would take the job to the CPU and pay a sustained multi-watt cost for multi-hour sessions. For a continuously running image-sequence-to-video pipeline, that single substitution is the line between 'background observer you forget about' and 'laptop that runs hot after an hour'. libx265 software encoding of a 24 FPS 1080p stream can easily saturate two CPU cores; hevc_videotoolbox at 2 FPS barely registers in Activity Monitor.

Can Fazm generate synthetic video from a single image, like SVD or HunyuanVideo-I2V?

No. Fazm does not ship a diffusion model, does not embed a UNet, and does not run ComfyUI workflows. If your goal is to produce new pixels from a single still, you want the diffusion stack with an NVIDIA GPU. What Fazm does is the other half of the keyword: take a real stream of images you already have (most often a captured screen) and produce a local H.265 video file that the reasoning layer can watch. Two jobs, one keyword, and they do not substitute for each other.

Can I use the same pipeline for my own image directory, not just screenshots?

The Fazm consumer app wires the pipeline to screen capture because that is the product. But the underlying flow is general: CGImage in, HEVC chunk out, via ffmpeg stdin and hevc_videotoolbox. If you are a developer building a local image-to-video workflow on a Mac (time-lapse from a folder of JPEGs, frame-by-frame render review, animated chart exports), the same '-f rawvideo -pixel_format bgra -i -' pattern works. The productised version, where an AI reasoning layer sits on top of the chunks, is what Fazm packages for non-developers.

Why 2 FPS, not 24 or 30?

Because the downstream consumer is a vision-language model reasoning about UI state, not a human watching a cinematic playback. At 2 FPS the model still sees every meaningful transition (a click, a scroll, a new modal opening), the encoded chunks are roughly an order of magnitude smaller than a 24 FPS equivalent, and the capture loop never has to compete for CPU with whatever app you are recording. The SessionRecorder configuration literally sets framesPerSecond: 2.0 with a comment 'Lower FPS than research recorder to save CPU'. This is the tax you should not be paying if your use case is 'AI watches the screen', and most roundup-level tooling forces a higher default because the defaults were written for humans.

Does the reasoning layer stay local too, or does it call a cloud model?

The capture pipeline is fully local: ScreenCaptureKit, a BGRA CGContext, ffmpeg with hevc_videotoolbox, and chunks written under Library/Caches/observer-recordings. The reasoning step is configurable. Out of the box, Fazm can route analysis through Gemini, which is a network call. If you want to stay fully offline, Fazm exposes a Custom API Endpoint field in Settings that sets ANTHROPIC_BASE_URL on the bridge subprocess, so you can point reasoning at an Anthropic-compatible proxy in front of a local model (Ollama, LM Studio, llama.cpp server). The image-to-video half does not care where reasoning runs; it just produces chunks.