Real-Time AI Agent Performance - Fixing the Screenshot Pipeline
Real-Time AI Agent Performance - Fixing the Screenshot Pipeline
You built an AI agent that watches your screen and takes actions. It works, but it's slow. Before you blame the LLM, check your screenshot capture pipeline. Nine times out of ten, that's where the time goes.
Profiling the Pipeline
Break down where milliseconds are spent in a typical capture-analyze-act cycle:
| Step | Typical Time | |------|-------------| | Screen capture | 50-150ms | | Image encoding | 50-200ms | | Network transfer | 100-500ms | | LLM inference | 500-2000ms | | Action execution | 10-50ms |
The LLM looks like the bottleneck in isolation. But capture and encoding happen every cycle, and they compound. If you're capturing 2 frames per second, you're spending 200-700ms per second just on capture overhead.
Practical Optimizations
Reduce capture resolution
A 5K Retina display captures frames at 5120x2880. Most vision models process at 1024x1024 or lower. Capture at 1280x720 from the start using ScreenCaptureKit's built-in scaling. This alone cuts encoding time by 80%.
Skip unchanged frames
Compare frame hashes before encoding. If the screen hasn't changed since the last capture, skip the entire pipeline. Desktop screens are static most of the time - during file downloads, long compilations, or when waiting for user input.
Capture regions, not full screen
If your agent is interacting with a specific window, capture only that window. ScreenCaptureKit supports window-level and region-level capture. A 800x600 window capture is 15x smaller than a full 5K screen.
Use hardware encoding
Apple Silicon has dedicated hardware for JPEG and HEIF encoding. Use VideoToolbox instead of software encoding. The difference is dramatic - hardware JPEG encoding runs in single-digit milliseconds.
The Bigger Win
The biggest performance improvement comes from not capturing screenshots at all when you don't need them. Use the accessibility API for structured data - reading text, identifying buttons, checking element states. Reserve screenshots for truly visual tasks like verifying layout or reading images.
Agents that combine accessibility data with occasional screenshots are consistently faster than pure screenshot-based agents.
Fazm is an open source macOS AI agent. Open source on GitHub.