Half a Million Computer Actions in Seven Days: What the Data Revealed
Half a Million Computer Actions in Seven Days: What the Data Revealed
The Terminator desktop automation library logged half a million computer actions in its first week of heavy usage. Clicks, keystrokes, scrolls, window manipulations, text selections - each one recorded with outcome, latency, and context. The aggregate data revealed patterns we did not expect, and a few we should have anticipated but did not.
This post breaks down what those 500,000 actions actually looked like - the distribution, the failure modes, and what changes when you move from tens of actions per session to hundreds of thousands per week.
The Action Distribution
The first surprise was how lopsided the distribution was.
About 78% of all actions were single-click events or keystrokes. These are the atomic building blocks - clicking a button, typing a character, pressing Enter. They are fast, they fail rarely, and they complete in under 50 milliseconds on a modern Mac.
About 14% were scroll operations and text selections - slightly more complex, occasionally interrupted by UI redraws.
The remaining 8% were the expensive operations: drag-and-drop sequences, modifier-key combos (Command+Shift+Option chains), coordinate-targeted interactions without a stable accessibility element to anchor to, and multi-step operations that had to be executed as a single atomic unit or not at all.
That 8% consumed roughly 40% of total debugging effort and accounted for over 70% of all failures.
Failure Rate by Action Type
Breaking down failure rates by action category:
| Action Type | Failure Rate | Notes |
|---|---|---|
| Simple click (element-targeted) | 0.08% | Element not found is rare if UI is stable |
| Simple click (coordinate-targeted) | 1.2% | UI shifts break coordinate assumptions |
| Keystroke | 0.03% | Near-zero failures |
| Scroll | 0.4% | Scroll targets that disappear mid-scroll |
| Text selection | 0.9% | Font rendering edge cases on Retina |
| Drag-and-drop | 7.8% | Application-specific drag handling varies widely |
| Multi-step atomic sequence | 11.3% | Any step failure rolls the whole sequence |
The lesson from this table: element-targeted clicks are dramatically more reliable than coordinate-targeted ones. When the accessibility tree gives you a stable element reference, use it. Coordinates are a last resort.
Drag-and-drop's 7.8% failure rate is high enough that any automation relying heavily on drag operations needs explicit fallback strategies. In most cases, keyboard shortcuts accomplish the same result with a fraction of the failure rate.
The Verification Overhead Problem
At low volume, you can afford to verify every action. After each click, take a screenshot. Confirm the expected state change occurred. This is good practice and catches errors early.
At 500,000 actions per week, the math breaks badly.
If each verification screenshot takes 150ms to capture and process, and you run them after every action, you add 75,000 seconds of overhead per week - just over 20 hours. That is not a real number in a continuous automation system; it means your agents are spending more time verifying than acting.
The solution is sampled verification with risk-weighted sampling rates:
VERIFICATION_RATES = {
"simple_click_element": 0.02, # 2% sampled
"simple_click_coordinate": 0.15, # 15% sampled
"keystroke": 0.005, # 0.5% sampled
"drag_drop": 1.0, # 100% - always verify
"multi_step_sequence": 1.0, # 100% - always verify
}
def should_verify(action_type: str) -> bool:
rate = VERIFICATION_RATES.get(action_type, 0.1)
return random.random() < rate
This drops verification overhead by roughly 85% while keeping full verification on the high-failure-rate action types that need it.
Cascading Failures and the 3-Action Rule
One pattern that emerged from the failure data: failures cluster. When one action fails, the probability of the next action failing is 4x higher than baseline. A failed click leaves the UI in an unexpected state. The next action, planned for the expected post-click state, encounters something different.
This clustering means a single failure can cascade into 5-10 consecutive failures before the agent detects it has gone off the rails.
The practical response is what we call the 3-action rule: if three consecutive actions fail or produce unexpected results, halt the current sequence, take a full screenshot, re-analyze the current UI state from scratch, and replan from the current position rather than continuing on the assumption that the original plan is still valid.
class ActionExecutor:
def __init__(self):
self.consecutive_failures = 0
self.FAILURE_THRESHOLD = 3
def execute(self, action):
result = action.run()
if result.failed:
self.consecutive_failures += 1
if self.consecutive_failures >= self.FAILURE_THRESHOLD:
self.replan_from_current_state()
self.consecutive_failures = 0
else:
self.consecutive_failures = 0
return result
Latency Distribution
The latency data had a heavy tail. Median action latency was 38ms. The 95th percentile was 180ms. The 99th percentile was 1.4 seconds.
That 99th percentile matters for sequences of 50+ actions. With 50 actions at median latency, you finish in 2 seconds. If even two of those actions hit the 99th percentile, your sequence now takes 5 seconds. At 100 actions, the heavy tail dominates wall-clock time more than median performance does.
The implication: do not benchmark your automation sequences at median latency. Benchmark at P99. Your users will experience P99 regularly even if it only appears 1% of the time per action.
What This Means for Building Desktop Agents
The 500K action dataset changes how we think about the action layer for desktop agents.
First, reliability compounds. A sequence of 10 actions at 98% individual reliability has only an 82% chance of completing without a single failure. At 50 actions, that drops to 36%. Building reliable automation at scale requires either keeping sequences short, building aggressive retry/replan logic, or accepting that failures are normal and building recovery into the design.
Second, the action type matters more than the model quality. A brilliant planning model paired with a poorly-optimized action layer (heavy coordinate targeting, no fallback for drag failures) will underperform a simpler model using element-targeted clicks and keyboard shortcuts wherever possible.
Third, logging is not optional. Without action-level logging including success/failure, latency, and the UI state at time of execution, you are flying blind. The insights above came from structured logs. Teams that skip logging cannot improve their failure rates because they do not know what is failing or why.
Reliability at scale is the unglamorous foundation that makes intelligent desktop automation actually useful. Speed and planning quality are the headline metrics, but your failure rate at action 47 of a 50-action sequence is what determines whether the tool earns daily use.
Fazm is an open source macOS AI agent. Open source on GitHub.