Automating Hundreds of Screenshots with Desktop Accessibility APIs
Automating Hundreds of Screenshots with Desktop Accessibility APIs
When you need to create hundreds of specific screenshots - different states, different screens, different configurations - doing it manually is not just tedious. It is error-prone. You miss states. You forget to resize windows. You accidentally capture your notification bar.
Screenshot automation needs to be state-aware, not position-aware. The macOS AXUIElement accessibility API is the foundation that makes this practical.
Why Pixel-Based Approaches Break at Scale
The naive approach is to script mouse movements and clicks, then capture screenshots at fixed intervals. This breaks constantly in practice:
- A dialog box appears one pixel off from its expected position
- The app takes 200ms longer to load than your hardcoded sleep
- A system notification pops up mid-capture
- The user has a different screen resolution or scaling factor
Fixed coordinates are a maintenance nightmare. You end up with a test suite that works on your machine on Tuesday and breaks on Wednesday after an OS update.
AXUIElement - The macOS Accessibility Tree
On macOS, AXUIElement gives you direct access to the UI tree of any application. Every visible element - buttons, text fields, windows, scroll views - is a node in this tree with queryable properties.
Instead of "click at position 340, 220" you say "click the button with label Submit in the main window." Instead of "wait 2 seconds" you say "wait until the element with role AXProgressIndicator is no longer in the tree."
Here is a Swift example that targets a button by its accessibility label:
import Cocoa
func findButton(inApp appName: String, labeled label: String) -> AXUIElement? {
let workspace = NSWorkspace.shared
guard let app = workspace.runningApplications.first(where: {
$0.localizedName == appName
}) else { return nil }
let axApp = AXUIElementCreateApplication(app.processIdentifier)
var windows: CFTypeRef?
AXUIElementCopyAttributeValue(axApp, kAXWindowsAttribute as CFString, &windows)
guard let windowList = windows as? [AXUIElement] else { return nil }
for window in windowList {
if let button = findElement(in: window, role: kAXButtonRole, label: label) {
return button
}
}
return nil
}
func clickAndCapture(button: AXUIElement, outputPath: String) {
// Perform the click
AXUIElementPerformAction(button, kAXPressAction as CFString)
// Wait for stable state - poll until no loading indicator visible
waitForStableState(timeout: 3.0)
// Capture with ScreenCaptureKit
captureScreen(to: outputPath)
}
The waitForStableState function is the critical piece. It polls the accessibility tree looking for loading indicators, progress bars, or any element you define as "in-transition." Only when those are gone do you capture.
A Real Screenshot Manifest
Define your screenshots as data, not imperative code. A manifest file describes each capture:
[
{
"id": "onboarding-step-1",
"app": "MyApp",
"navigate": ["MainWindow", "SettingsButton"],
"waitFor": {"role": "AXWindow", "title": "Settings"},
"state": {"toggle": "DarkModeSwitch", "value": false},
"output": "screenshots/settings-light.png"
},
{
"id": "onboarding-step-2",
"app": "MyApp",
"navigate": ["MainWindow", "SettingsButton"],
"waitFor": {"role": "AXWindow", "title": "Settings"},
"state": {"toggle": "DarkModeSwitch", "value": true},
"output": "screenshots/settings-dark.png"
}
]
The automation engine reads this manifest, executes each entry, and reports failures without stopping the entire batch. A failure on screenshot 47 does not corrupt screenshots 48 through 300.
Performance: What to Expect
Research by MacPaw's Screen2AX team (2025) found that only about 33% of macOS applications expose complete accessibility trees. For the remaining 67%, you need fallback strategies - either vision-based element detection or hybrid approaches.
For well-supported apps, AXUIElement queries return in under 10ms. A full screenshot batch of 200 images with state navigation between each one typically completes in 15 to 30 minutes, depending on app load times.
Using ScreenCaptureKit instead of the older CGDisplayCreateImage reduces per-screenshot overhead from roughly 80ms to under 20ms on M-series hardware. For large batches, that compounds.
Handling the 33% Problem
When an app has incomplete accessibility support, you have two options:
Option 1 - Hybrid fallback. Use vision-based element detection (a model like Florence-2 or Apple's own Vision framework) to locate elements by visual appearance when the accessibility tree returns nothing useful. Slower, but covers the gap.
Option 2 - Inject accessibility. For apps you control, add accessibilityIdentifier to every UI element during development. This is the permanent fix and costs almost nothing to maintain.
// In your app's view setup
submitButton.setAccessibilityIdentifier("submit-button")
submitButton.setAccessibilityLabel("Submit")
The Verification Step
Before every capture, verify that the UI is in the exact state you expect. A minimal verification checks:
- The target window is frontmost
- No modal dialogs are blocking
- All expected elements are visible (not hidden, not loading)
- Optional: the window frame matches expected dimensions
Skip the verification and you end up with screenshots that look correct but have subtle wrong states - a spinner that just disappeared, a dropdown that had not closed yet.
func verifyReadyToCapture(window: AXUIElement, expectedElements: [String]) -> Bool {
for elementLabel in expectedElements {
guard findElement(in: window, role: nil, label: elementLabel) != nil else {
print("Verification failed: element '\(elementLabel)' not found")
return false
}
}
// Check no progress indicators are visible
if findElement(in: window, role: kAXProgressIndicatorRole, label: nil) != nil {
return false
}
return true
}
Scaling to Hundreds
Once the pattern is in place, scaling from 10 screenshots to 500 is an engineering problem, not a design problem:
- Run captures in parallel across multiple app instances when the app supports it
- Use a coordinator process that manages app lifecycle and retries failed captures
- Store the manifest in version control so your screenshot set evolves with the product
- Generate diff images to catch unintended visual regressions between runs
The key shift is treating screenshot automation like a test suite - deterministic, reproducible, diffable, and integrated into CI.
Fazm is an open source macOS AI agent that uses AXUIElement extensively for desktop automation. Open source on GitHub.