Your Agent Watches Video Wrong - Keyframe Extraction vs Frame-by-Frame
Your Agent Watches Video Wrong
Sending every frame of a video to a vision model is like reading a book by analyzing every individual letter. It is technically thorough and practically insane. A 10-minute video at 30fps is 18,000 frames. Processing each one through a vision API costs a fortune and takes forever.
The fix is keyframe extraction: pull out the frames that actually matter and ignore the rest.
What Keyframes Capture
Most video content is redundant. A screen recording of someone using an app has maybe 20-30 meaningfully different states in a 10-minute session. Everything between those states is the cursor moving or a page loading - information you do not need.
Keyframe extraction identifies scene changes - moments where the visual content shifts significantly. These are the frames worth analyzing.
The Practical Pipeline
- Extract keyframes using FFmpeg's scene detection filter:
ffmpeg -i video.mp4 -vf "select=gt(scene,0.3)" -vsync vfr frames/%04d.png - Run OCR on each keyframe to extract text content - this captures UI labels, text on screen, code snippets, and any readable content
- Send keyframes to a vision model only when OCR is not sufficient - for understanding layouts, identifying UI elements, or reading handwritten content
- Stitch the narrative by ordering keyframes chronologically with their extracted text
Why OCR First
OCR on keyframes handles 80% of video understanding tasks. If someone is demonstrating a workflow, the text on screen tells you what application they are using, what they typed, and what the results were. You do not need a vision model for that.
Reserve the expensive vision model calls for frames where OCR returns nothing useful - diagrams, photos, visual layouts without text.
This approach turns a $50 video analysis job into a $2 one.
Fazm is an open source macOS AI agent. Open source on GitHub.