Your Agent Watches Video Wrong - Keyframe Extraction vs Frame-by-Frame

Matthew Diakonov

Updated March 19, 2026

video-analysis keyframes ocr ai-agents computer-vision

Your Agent Watches Video Wrong

Sending every frame of a video to a vision model is like reading a book by analyzing every individual letter. It is technically thorough and practically insane. A 10-minute video at 30fps is 18,000 frames. Processing each one through a vision API costs a fortune and takes forever.

The fix is keyframe extraction: pull out the frames that actually matter and ignore the rest.

What Keyframes Capture

Most video content is redundant. A screen recording of someone using an app has maybe 20-30 meaningfully different states in a 10-minute session. Everything between those states is the cursor moving or a page loading - information you do not need.

Keyframe extraction identifies scene changes - moments where the visual content shifts significantly. These are the frames worth analyzing.

The Practical Pipeline

Extract keyframes using FFmpeg's scene detection filter: ffmpeg -i video.mp4 -vf "select=gt(scene,0.3)" -vsync vfr frames/%04d.png
Run OCR on each keyframe to extract text content - this captures UI labels, text on screen, code snippets, and any readable content
Send keyframes to a vision model only when OCR is not sufficient - for understanding layouts, identifying UI elements, or reading handwritten content
Stitch the narrative by ordering keyframes chronologically with their extracted text

Why OCR First

OCR on keyframes handles 80% of video understanding tasks. If someone is demonstrating a workflow, the text on screen tells you what application they are using, what they typed, and what the results were. You do not need a vision model for that.

Reserve the expensive vision model calls for frames where OCR returns nothing useful - diagrams, photos, visual layouts without text.

This approach turns a $50 video analysis job into a $2 one.

Fazm is an open source macOS AI agent. Open source on GitHub.

Your Agent Watches Video Wrong - Keyframe Extraction vs Frame-by-Frame

Your Agent Watches Video Wrong

What Keyframes Capture

The Practical Pipeline

Why OCR First

More on This Topic

Related Posts

AI Agent News April 2026: Claude Code, OpenClaw, and the Agent Infrastructure Race

AI Agents vs Copilot: When to Let AI Drive vs Ride Shotgun

Notion AI Features 2026: Every Capability, Tested and Compared