Scaling Real-Time AI - Why the Screenshot Capture Pipeline Is Always the Bottleneck

Fazm Team··3 min read

Scaling Real-Time AI - The Screenshot Bottleneck

Scaling real-time AI is tricky. You'd expect the LLM inference to be the slowest part. Or maybe the action execution - clicking, typing, navigating. But the biggest bottleneck is almost always the screenshot capture pipeline.

The Capture Problem

A real-time AI agent needs to see what's on screen. That means capturing frames, encoding them, and sending them to a model for analysis. On macOS, you're using ScreenCaptureKit or the older CGWindowListCreateImage API. Each capture involves:

  1. Requesting a frame from the system
  2. Converting the raw pixel buffer to an image format
  3. Compressing it (usually JPEG or PNG)
  4. Sending it to the LLM or vision model
  5. Waiting for the response

At 1-2 frames per second, this pipeline dominates your agent's response time. The capture itself takes 50-100ms. Encoding takes another 50-200ms depending on resolution and format. Network transfer for cloud models adds latency on top.

Why Resolution Kills You

A 5K Retina display produces enormous frames. A full-screen capture at native resolution is 14.7 million pixels. Even compressed to JPEG, that's a multi-megabyte payload. Most vision models don't need that resolution - they work fine at 1080p or even 720p - but developers often capture at full resolution by default.

Fix: capture at a reduced resolution from the start. ScreenCaptureKit lets you set the output size. Downscale during capture, not after. This saves both encoding time and bandwidth.

Smarter Capture Strategies

  • Region-based capture - only capture the area that matters, not the full screen
  • Diff detection - skip frames where nothing changed since the last capture
  • Hybrid approach - use accessibility API data for most interactions and only capture screenshots when visual verification is needed
  • Frame pooling - reuse buffers instead of allocating new memory each frame

The Real Solution

The best performing agents minimize screenshots entirely. Use the accessibility tree for structured UI data. Reserve screenshots for visual tasks - identifying images, reading non-standard UI elements, or verifying layout. This hybrid approach is 5-10x faster than screenshot-only agents.

The capture pipeline will always be a bottleneck if you treat it as the primary input. Treat it as a fallback instead.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts