Automating Hundreds of Screenshots with Desktop Accessibility APIs

Matthew Diakonov

Updated March 30, 2026

accessibility-api screenshots desktop-automation macos productivity

Automating Hundreds of Screenshots with Desktop Accessibility APIs

When you need to create hundreds of specific screenshots - different states, different screens, different configurations - doing it manually is not just tedious. It is error-prone. You miss states. You forget to resize windows. You accidentally capture your notification bar.

Screenshot automation needs to be state-aware, not position-aware. The macOS AXUIElement accessibility API is the foundation that makes this practical.

Why Pixel-Based Approaches Break at Scale

The naive approach is to script mouse movements and clicks, then capture screenshots at fixed intervals. This breaks constantly in practice:

A dialog box appears one pixel off from its expected position
The app takes 200ms longer to load than your hardcoded sleep
A system notification pops up mid-capture
The user has a different screen resolution or scaling factor

Fixed coordinates are a maintenance nightmare. You end up with a test suite that works on your machine on Tuesday and breaks on Wednesday after an OS update.

AXUIElement - The macOS Accessibility Tree

On macOS, AXUIElement gives you direct access to the UI tree of any application. Every visible element - buttons, text fields, windows, scroll views - is a node in this tree with queryable properties.

Instead of "click at position 340, 220" you say "click the button with label Submit in the main window." Instead of "wait 2 seconds" you say "wait until the element with role AXProgressIndicator is no longer in the tree."

Here is a Swift example that targets a button by its accessibility label:

import Cocoa

func findButton(inApp appName: String, labeled label: String) -> AXUIElement? {
    let workspace = NSWorkspace.shared
    guard let app = workspace.runningApplications.first(where: {
        $0.localizedName == appName
    }) else { return nil }

    let axApp = AXUIElementCreateApplication(app.processIdentifier)
    var windows: CFTypeRef?
    AXUIElementCopyAttributeValue(axApp, kAXWindowsAttribute as CFString, &windows)

    guard let windowList = windows as? [AXUIElement] else { return nil }
    for window in windowList {
        if let button = findElement(in: window, role: kAXButtonRole, label: label) {
            return button
        }
    }
    return nil
}

func clickAndCapture(button: AXUIElement, outputPath: String) {
    // Perform the click
    AXUIElementPerformAction(button, kAXPressAction as CFString)

    // Wait for stable state - poll until no loading indicator visible
    waitForStableState(timeout: 3.0)

    // Capture with ScreenCaptureKit
    captureScreen(to: outputPath)
}

The waitForStableState function is the critical piece. It polls the accessibility tree looking for loading indicators, progress bars, or any element you define as "in-transition." Only when those are gone do you capture.

A Real Screenshot Manifest

Define your screenshots as data, not imperative code. A manifest file describes each capture:

[
  {
    "id": "onboarding-step-1",
    "app": "MyApp",
    "navigate": ["MainWindow", "SettingsButton"],
    "waitFor": {"role": "AXWindow", "title": "Settings"},
    "state": {"toggle": "DarkModeSwitch", "value": false},
    "output": "screenshots/settings-light.png"
  },
  {
    "id": "onboarding-step-2",
    "app": "MyApp",
    "navigate": ["MainWindow", "SettingsButton"],
    "waitFor": {"role": "AXWindow", "title": "Settings"},
    "state": {"toggle": "DarkModeSwitch", "value": true},
    "output": "screenshots/settings-dark.png"
  }
]

The automation engine reads this manifest, executes each entry, and reports failures without stopping the entire batch. A failure on screenshot 47 does not corrupt screenshots 48 through 300.

Performance: What to Expect

Research by MacPaw's Screen2AX team (2025) found that only about 33% of macOS applications expose complete accessibility trees. For the remaining 67%, you need fallback strategies - either vision-based element detection or hybrid approaches.

For well-supported apps, AXUIElement queries return in under 10ms. A full screenshot batch of 200 images with state navigation between each one typically completes in 15 to 30 minutes, depending on app load times.

Using ScreenCaptureKit instead of the older CGDisplayCreateImage reduces per-screenshot overhead from roughly 80ms to under 20ms on M-series hardware. For large batches, that compounds.

Handling the 33% Problem

When an app has incomplete accessibility support, you have two options:

Option 1 - Hybrid fallback. Use vision-based element detection (a model like Florence-2 or Apple's own Vision framework) to locate elements by visual appearance when the accessibility tree returns nothing useful. Slower, but covers the gap.

Option 2 - Inject accessibility. For apps you control, add accessibilityIdentifier to every UI element during development. This is the permanent fix and costs almost nothing to maintain.

// In your app's view setup
submitButton.setAccessibilityIdentifier("submit-button")
submitButton.setAccessibilityLabel("Submit")

The Verification Step

Before every capture, verify that the UI is in the exact state you expect. A minimal verification checks:

The target window is frontmost
No modal dialogs are blocking
All expected elements are visible (not hidden, not loading)
Optional: the window frame matches expected dimensions

Skip the verification and you end up with screenshots that look correct but have subtle wrong states - a spinner that just disappeared, a dropdown that had not closed yet.

func verifyReadyToCapture(window: AXUIElement, expectedElements: [String]) -> Bool {
    for elementLabel in expectedElements {
        guard findElement(in: window, role: nil, label: elementLabel) != nil else {
            print("Verification failed: element '\(elementLabel)' not found")
            return false
        }
    }

    // Check no progress indicators are visible
    if findElement(in: window, role: kAXProgressIndicatorRole, label: nil) != nil {
        return false
    }

    return true
}

Scaling to Hundreds

Once the pattern is in place, scaling from 10 screenshots to 500 is an engineering problem, not a design problem:

Run captures in parallel across multiple app instances when the app supports it
Use a coordinator process that manages app lifecycle and retries failed captures
Store the manifest in version control so your screenshot set evolves with the product
Generate diff images to catch unintended visual regressions between runs

The key shift is treating screenshot automation like a test suite - deterministic, reproducible, diffable, and integrated into CI.

Fazm is an open source macOS AI agent that uses AXUIElement extensively for desktop automation. Open source on GitHub.

Automating Hundreds of Screenshots with Desktop Accessibility APIs

Automating Hundreds of Screenshots with Desktop Accessibility APIs

Why Pixel-Based Approaches Break at Scale

AXUIElement - The macOS Accessibility Tree

A Real Screenshot Manifest

Performance: What to Expect

Handling the 33% Problem

The Verification Step

Scaling to Hundreds

More on This Topic

Related Posts

Fazm: Open Source macOS AI Agent on GitHub

AI Agent Desktop: How Autonomous Software Controls Your Computer in 2026

macOS AI Agent: How Desktop Agents Work on Mac in 2026