Your Agent Watches Video Wrong - Keyframe Extraction vs Frame-by-Frame

Fazm Team··2 min read

Your Agent Watches Video Wrong

Sending every frame of a video to a vision model is like reading a book by analyzing every individual letter. It is technically thorough and practically insane. A 10-minute video at 30fps is 18,000 frames. Processing each one through a vision API costs a fortune and takes forever.

The fix is keyframe extraction: pull out the frames that actually matter and ignore the rest.

What Keyframes Capture

Most video content is redundant. A screen recording of someone using an app has maybe 20-30 meaningfully different states in a 10-minute session. Everything between those states is the cursor moving or a page loading - information you do not need.

Keyframe extraction identifies scene changes - moments where the visual content shifts significantly. These are the frames worth analyzing.

The Practical Pipeline

  1. Extract keyframes using FFmpeg's scene detection filter: ffmpeg -i video.mp4 -vf "select=gt(scene,0.3)" -vsync vfr frames/%04d.png
  2. Run OCR on each keyframe to extract text content - this captures UI labels, text on screen, code snippets, and any readable content
  3. Send keyframes to a vision model only when OCR is not sufficient - for understanding layouts, identifying UI elements, or reading handwritten content
  4. Stitch the narrative by ordering keyframes chronologically with their extracted text

Why OCR First

OCR on keyframes handles 80% of video understanding tasks. If someone is demonstrating a workflow, the text on screen tells you what application they are using, what they typed, and what the results were. You do not need a vision model for that.

Reserve the expensive vision model calls for frames where OCR returns nothing useful - diagrams, photos, visual layouts without text.

This approach turns a $50 video analysis job into a $2 one.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts