Memory Triage for AI Agents - Why 100% Retention Is a Bug

M
Matthew Diakonov

Memory Triage for AI Agents - Why 100% Retention Is a Bug

An AI agent that remembers everything is not smart - it is cluttered. When every fact gets the same priority, the agent spends tokens retrieving outdated preferences, stale project contexts, and one-time corrections that no longer apply.

The fix is not better storage. It is intentional forgetting.

The 100% Retention Problem

Most agent memory systems are append-only. Every interaction adds new facts, corrections, and preferences. Nothing gets removed. Over weeks of use, the memory grows into a sprawling collection where critical instructions sit next to trivial observations.

When the agent loads context at session start, it has to process all of it. A memory file that was 2KB in week one is 50KB by month three. The signal-to-noise ratio drops continuously.

Research from the FiFA benchmark (300 simulation runs across five memory budget levels) shows that naive retention - keeping everything - actually degrades task completion scores compared to structured forgetting policies. The best-performing hybrid policy scored around 0.911 composite performance while keeping costs tractable.

Six Forgetting Policies, Ranked

The research literature has formalized six forgetting policies with measurable trade-offs:

  1. FIFO (First In, First Out) - evict oldest memories first. Simple but blind to importance.
  2. LRU (Least Recently Used) - evict memories not accessed recently. Optimal when usefulness decays exponentially with time.
  3. Priority Decay - weight memories by importance score, decay that score over time. Better for heterogeneous memory types.
  4. Reflection-Summary - compress old memories into summaries rather than deleting them. Preserves signal, reduces tokens.
  5. Random Drop - probabilistic eviction. Surprisingly useful as a baseline.
  6. Hybrid - stage the above mechanisms. Best composite performance but requires tuning.

For most desktop agents handling daily workflows, LRU combined with priority decay is the right starting point. Pure LRU is optimal when usefulness decays exponentially - which matches most personal assistant tasks, where last week's meeting notes matter less than yesterday's.

How Memory Decay Works in Practice

A concrete decay implementation from the generative agents literature uses recency scoring that decreases hourly:

def decay_score(memory, current_time, decay_factor=0.995):
    hours_elapsed = (current_time - memory.last_accessed).total_seconds() / 3600
    return memory.base_importance * (decay_factor ** hours_elapsed)

For adaptive decay that slows down for frequently recalled memories:

def adaptive_decay_rate(base_rate, recall_count, beta=0.1, gamma=0.5):
    # Memories recalled often decay slower
    return base_rate * (1 - beta * math.tanh(gamma * recall_count))

The intuition: a preference you reference daily should decay slower than a one-time project context. The tanh function bounds the adjustment so well-recalled memories approach but never reach zero decay.

The Mem0 Benchmark: 91% Faster Retrieval

Mem0's production data provides a useful baseline. Their managed memory system achieves a 91% reduction in response time compared to loading full conversation context, while maintaining high accuracy on recall tasks. The key is that structured memory - extracting facts into discrete nodes - retrieves in sub-200ms even at scale.

That 91% improvement comes from not loading everything. It comes from knowing what to load.

Implementing Memory Triage: A Practical Approach

Here is a working triage system that fits in a single Python module:

import json
import math
from datetime import datetime, timedelta
from pathlib import Path

class MemoryTriage:
    def __init__(self, active_limit=50, archive_path="memory_archive.json"):
        self.active_limit = active_limit
        self.archive_path = Path(archive_path)
        self.archive = self._load_archive()

    def score_memory(self, memory: dict, now: datetime) -> float:
        """Score a memory by recency, access frequency, and base importance."""
        hours_old = (now - datetime.fromisoformat(memory["last_accessed"])).total_seconds() / 3600
        recall_count = memory.get("recall_count", 0)
        base_importance = memory.get("importance", 0.5)

        # Adaptive decay: recalled memories decay slower
        decay_rate = 0.995 * (1 - 0.1 * math.tanh(0.5 * recall_count))
        recency_score = decay_rate ** hours_old

        return base_importance * recency_score * (1 + 0.1 * recall_count)

    def triage(self, memories: list[dict]) -> tuple[list[dict], list[dict]]:
        """Return (active, archived) after scoring all memories."""
        now = datetime.now()
        scored = [(self.score_memory(m, now), m) for m in memories]
        scored.sort(key=lambda x: x[0], reverse=True)

        active = [m for _, m in scored[:self.active_limit]]
        archived = [m for _, m in scored[self.active_limit:]]
        return active, archived

    def _load_archive(self) -> list[dict]:
        if self.archive_path.exists():
            return json.loads(self.archive_path.read_text())
        return []

The active_limit is the key parameter. Setting it to 50 forces the system to prioritize. Archived memories remain searchable on demand - they just do not load by default.

What a Triage Pass Looks Like

Running a weekly triage on a real agent's memory file:

Before triage: 847 memories, 62KB
After triage:  41 active memories, 3.1KB loaded at session start
               806 archived memories available on demand

Decay distribution:
  Score > 0.8  (high priority):  41 memories  (4.8%)
  Score 0.4-0.8 (medium):       203 memories  (24%)
  Score < 0.4  (low/archived):  603 memories  (71%)

This matches the research finding that roughly 57% of memories older than 30 days score below the retention threshold. You are not losing information - you are deferring it.

Natural Decay Is Not Data Loss

Letting memories decay does not mean deleting them permanently. It means moving them out of the active context window. A decayed memory still exists in storage - it just is not loaded by default. If it becomes relevant again, it can be retrieved on demand.

The MAGMA architecture (Multi-Graph based Agentic Memory) demonstrates this at scale: it achieves competitive accuracy (83.9% on benchmarks) while using only 0.7K to 4.2K tokens per query - a reduction of more than 95% compared to loading full conversation history. The compression comes from structured graph retrieval, not from losing information.

The Right Amount of Forgetting

Human memory decays for a reason - it keeps the most relevant information accessible. AI agent memory should work the same way. Perfect recall is not the goal. Useful recall is.

The practical test: if your agent is loading the same stale preference about a project you wrapped up two months ago every single session, that is a bug. Triage fixes it.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts