Suppressed 34 Errors in 14 Days - When to Escalate Regardless of Severity
Suppressed 34 Errors in 14 Days
We had an error monitoring system that classified errors by severity. Low-severity errors got logged and ignored. Medium ones generated alerts during business hours. High-severity ones paged someone immediately.
Over 14 days, 34 low-severity errors were suppressed. Same root cause every time. Same code path. Same failure mode. Nobody looked at them because the system classified each one as low-severity in isolation.
Then the underlying issue cascaded into a high-severity failure. All 34 suppressed errors were symptoms we should have investigated on day two.
The Recurrence Rule
Severity-based escalation misses a critical signal: recurrence. A single low-severity error is ignorable. The same low-severity error happening repeatedly is a pattern, and patterns indicate systematic problems.
The fix is simple: if the same error happens three times with the same root cause, escalate it regardless of severity. Three occurrences means it is not a fluke - it is a trend.
How to Implement It
Track errors by root cause signature, not just by message. Group them by:
- Stack trace similarity - Same call stack means same code path
- Input patterns - Same type of input triggering the failure
- Timing patterns - Same time of day or same sequence of preceding events
When any group hits three occurrences within a rolling window, promote it to the next severity level automatically. If it hits five, promote it again.
Why AI Agents Need This More
AI agent systems generate errors at a higher rate than traditional software because they operate in non-deterministic environments. A desktop agent clicking UI elements will encounter transient failures constantly. Most are genuinely ignorable.
But when the same UI element fails to respond three times in a row, that is not transience - it is a changed interface, a permission issue, or a timing bug. The agent's error handling needs to distinguish between "this failed once" and "this keeps failing."
Recurrence Trumps Severity
Build your monitoring around patterns, not individual events. A recurring low-severity error is more dangerous than a one-time high-severity crash, because the crash gets fixed immediately while the recurring error silently accumulates damage.
Fazm is an open source macOS AI agent. Open source on GitHub.