Suppressed 34 Errors in 14 Days - When to Escalate Regardless of Severity

Fazm Team··2 min read

Suppressed 34 Errors in 14 Days

We had an error monitoring system that classified errors by severity. Low-severity errors got logged and ignored. Medium ones generated alerts during business hours. High-severity ones paged someone immediately.

Over 14 days, 34 low-severity errors were suppressed. Same root cause every time. Same code path. Same failure mode. Nobody looked at them because the system classified each one as low-severity in isolation.

Then the underlying issue cascaded into a high-severity failure. All 34 suppressed errors were symptoms we should have investigated on day two.

The Recurrence Rule

Severity-based escalation misses a critical signal: recurrence. A single low-severity error is ignorable. The same low-severity error happening repeatedly is a pattern, and patterns indicate systematic problems.

The fix is simple: if the same error happens three times with the same root cause, escalate it regardless of severity. Three occurrences means it is not a fluke - it is a trend.

How to Implement It

Track errors by root cause signature, not just by message. Group them by:

  • Stack trace similarity - Same call stack means same code path
  • Input patterns - Same type of input triggering the failure
  • Timing patterns - Same time of day or same sequence of preceding events

When any group hits three occurrences within a rolling window, promote it to the next severity level automatically. If it hits five, promote it again.

Why AI Agents Need This More

AI agent systems generate errors at a higher rate than traditional software because they operate in non-deterministic environments. A desktop agent clicking UI elements will encounter transient failures constantly. Most are genuinely ignorable.

But when the same UI element fails to respond three times in a row, that is not transience - it is a changed interface, a permission issue, or a timing bug. The agent's error handling needs to distinguish between "this failed once" and "this keeps failing."

Recurrence Trumps Severity

Build your monitoring around patterns, not individual events. A recurring low-severity error is more dangerous than a one-time high-severity crash, because the crash gets fixed immediately while the recurring error silently accumulates damage.

Fazm is an open source macOS AI agent. Open source on GitHub.


More on This Topic

Related Posts