Memory Filters - Why AI Agents Need Aggressive Pruning

Matthew Diakonov

Updated March 26, 2026

agent-memory memory-management context-window pruning ai-agents

Memory Filters - Why AI Agents Need Aggressive Pruning

Most agent memory systems are hoarders. They save everything - every conversation, every fact, every preference the user mentioned once six months ago. The result is a bloated context that slows everything down and confuses the agent with outdated information.

The fix is aggressive filtering: only keep facts that were retrieved and actually used within a meaningful window. Research from Mem0 shows that fact-based memory with intelligent pruning requires roughly 1,800 tokens per conversation compared to 26,000 tokens when sending full context history - a 90% reduction in token consumption.

The Real Cost of Keeping Everything

Context windows are finite. Even with models that support 128K or 200K tokens, stuffing all of that capacity with memory records is wasteful and counterproductive. The agent spends tokens processing irrelevant information, and important recent context gets diluted.

Worse, stale memories actively mislead the agent. If a user changed their preferred email six months ago but the agent still has the old one in memory, it will use the wrong address. In the MemoryAgentBench benchmark, agents using pure FIFO (first-in, first-out) buffers that retain everything performed significantly worse on conflict resolution tasks than agents with pruning policies - the stale data corrupted their answers.

The practical cost shows up in three ways:

Token spend: Sending 26K tokens instead of 1.8K per request multiplies your LLM API bill by 10-15x
Latency: Larger context windows increase time-to-first-token; Zep's Temporal Knowledge Graph cut latency by 90% over naive retrieval baselines
Answer quality: More memory is not better memory - a Mem0 study found a 26% accuracy boost when switching from full-context to structured fact-based memory with pruning

Three Pruning Algorithms - When to Use Each

1. LRU (Least Recently Used) Eviction

LRU is the simplest pruning strategy. When your active memory store reaches capacity, you evict whichever memory was accessed least recently.

from collections import OrderedDict
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

@dataclass
class MemoryEntry:
    key: str
    value: str
    created_at: datetime = field(default_factory=datetime.utcnow)
    last_retrieved: Optional[datetime] = None
    retrieval_count: int = 0

class LRUMemoryStore:
    def __init__(self, capacity: int = 50):
        self.capacity = capacity
        self._store: OrderedDict[str, MemoryEntry] = OrderedDict()

    def get(self, key: str) -> Optional[str]:
        if key not in self._store:
            return None
        # Move to end (most recently used)
        self._store.move_to_end(key)
        entry = self._store[key]
        entry.last_retrieved = datetime.utcnow()
        entry.retrieval_count += 1
        return entry.value

    def put(self, key: str, value: str) -> None:
        if key in self._store:
            self._store.move_to_end(key)
        self._store[key] = MemoryEntry(key=key, value=value)
        if len(self._store) > self.capacity:
            # Evict least recently used (first item)
            evicted_key, _ = self._store.popitem(last=False)
            # Optionally archive evicted_key to cold storage here

    def active_count(self) -> int:
        return len(self._store)

LRU works well for conversational facts where recency is a strong signal. If a user told you their project name last week and you haven't needed it since, it's probably stale. The weakness: LRU ignores frequency. A memory accessed 50 times two weeks ago will be evicted before one accessed once yesterday.

2. Frequency-Based Scoring (LFU)

LFU (Least Frequently Used) keeps the memories your agent actually relies on, regardless of when they were last accessed. This suits long-lived facts like user preferences, role context, or recurring workflows.

import heapq

class LFUMemoryStore:
    def __init__(self, capacity: int = 50):
        self.capacity = capacity
        self._store: dict[str, MemoryEntry] = {}
        self._freq_heap: list[tuple[int, str]] = []  # (count, key)

    def get(self, key: str) -> Optional[str]:
        if key not in self._store:
            return None
        entry = self._store[key]
        entry.retrieval_count += 1
        entry.last_retrieved = datetime.utcnow()
        return entry.value

    def put(self, key: str, value: str) -> None:
        if len(self._store) >= self.capacity:
            # Rebuild heap and evict minimum frequency entry
            self._freq_heap = [
                (e.retrieval_count, k)
                for k, e in self._store.items()
            ]
            heapq.heapify(self._freq_heap)
            _, lfu_key = heapq.heappop(self._freq_heap)
            del self._store[lfu_key]

        self._store[key] = MemoryEntry(key=key, value=value)

LFU keeps your most-used memories even if they haven't been accessed in a while. The downside: very old high-frequency memories never get evicted even when they're no longer relevant.

3. Relevance-Decay Scoring (Recommended)

The most effective approach combines recency, frequency, and importance into a single score - then prunes anything below a threshold. This mirrors how the MemOS and Agentic Memory frameworks approach the problem.

import math
from datetime import datetime, timedelta

class RelevanceDecayStore:
    def __init__(
        self,
        capacity: int = 50,
        decay_days: float = 30.0,
        prune_threshold: float = 0.1,
    ):
        self.capacity = capacity
        self.decay_days = decay_days
        self.prune_threshold = prune_threshold
        self._store: dict[str, MemoryEntry] = {}

    def _score(self, entry: MemoryEntry) -> float:
        """
        Score = frequency_weight * recency_decay * importance_boost

        - frequency_weight: log(1 + retrieval_count) normalizes burst access
        - recency_decay: exponential decay based on days since last retrieval
        - importance_boost: optional multiplier for pinned/critical facts
        """
        now = datetime.utcnow()
        last_access = entry.last_retrieved or entry.created_at
        days_since_access = (now - last_access).total_seconds() / 86400

        frequency_weight = math.log1p(entry.retrieval_count)
        recency_decay = math.exp(-days_since_access / self.decay_days)
        return frequency_weight * recency_decay

    def prune(self) -> list[str]:
        """Remove memories below threshold. Returns list of pruned keys."""
        to_prune = [
            key for key, entry in self._store.items()
            if self._score(entry) < self.prune_threshold
        ]
        for key in to_prune:
            del self._store[key]
        return to_prune

    def put(self, key: str, value: str) -> None:
        self._store[key] = MemoryEntry(key=key, value=value)
        if len(self._store) > self.capacity:
            # Sort by score ascending, evict lowest
            sorted_keys = sorted(self._store, key=lambda k: self._score(self._store[k]))
            del self._store[sorted_keys[0]]

    def get(self, key: str) -> Optional[str]:
        if key not in self._store:
            return None
        entry = self._store[key]
        entry.last_retrieved = datetime.utcnow()
        entry.retrieval_count += 1
        return entry.value

With decay_days=30 and prune_threshold=0.1, a memory accessed once and untouched for 40 days gets a score of log(2) * exp(-40/30) ≈ 0.693 * 0.26 ≈ 0.18 - above threshold. After 60 days with zero additional access: 0.693 * exp(-60/30) ≈ 0.094 - pruned. Adjust decay_days up for long-lived agents, down for high-churn task agents.

How to Implement a Practical Memory Pipeline

A complete agent memory pipeline needs three tiers, not just one hot store:

┌─────────────────────────────────────────────────────────┐
│  ACTIVE (hot)  │  max 50 entries  │  relevance-decay    │
│  Injected into every prompt                             │
├─────────────────────────────────────────────────────────┤
│  ARCHIVE (warm) │  unlimited      │  semantic search    │
│  Retrieved on demand via similarity query               │
├─────────────────────────────────────────────────────────┤
│  COLD STORAGE  │  unlimited      │  exact key lookup    │
│  Rarely accessed; compliance, audit trail               │
└─────────────────────────────────────────────────────────┘

Step 1 - Classify on write. When the agent learns a new fact, assign it a tier immediately. User preferences, active project names, and recurring constraints go to active. Completed task details and old conversation summaries go to archive.

Step 2 - Track actual use, not just retrieval. Retrieval means the memory was fetched into context. Use means the agent cited or acted on it. Only increment retrieval_count when the agent output shows evidence of using the memory - otherwise retrieval from a semantic search artificially inflates frequency for unrelated noise.

Step 3 - Run a scheduled pruning job. Do not prune on every write - it's expensive. Instead, run a daily job that calls store.prune() and moves evicted entries to the archive tier:

import schedule

def daily_prune(active_store: RelevanceDecayStore, archive_store):
    pruned_keys = active_store.prune()
    for key in pruned_keys:
        # Don't delete - demote to warm archive
        archive_store.put(key, original_values[key])
    print(f"Pruned {len(pruned_keys)} entries from active memory")

schedule.every().day.at("02:00").do(daily_prune, active_store, archive_store)

Step 4 - Cap active memory at 50 entries. This is the key constraint that forces discipline. When the active store is full, any new write triggers an eviction. The agent is forced to decide whether new information is worth keeping over existing memories. In practice, 20-50 facts covers the relevant context for nearly all personal agent use cases.

What to Prune vs. What to Archive

Not all low-scoring memories should be deleted. Use this decision matrix:

Memory type	Low score action
User preferences (email, name, role)	Archive - never delete
Task context from completed project	Archive after 30 days
Temporary state (current file open, last command)	Delete immediately after session
Frequently wrong facts (corrected by user)	Delete + log correction
Conversation summaries	Archive with compression

The counterintuitive insight: deleting is rarely the right call. Archiving with semantic search handles edge cases where old context becomes suddenly relevant. What you want to eliminate from the active set is noise - low-score memories that pollute every prompt but rarely contribute.

The Numbers

Keeping active memory under 50 facts and pruning aggressively produces measurable gains:

Mem0's production data: 90% fewer tokens per conversation vs. full-context injection
Zep Temporal Knowledge Graph: 18.5% accuracy improvement on long-horizon tasks, 90% latency reduction vs. naive retrieval
Internal tests (Fazm): Agent response coherence improved noticeably when active memory dropped from ~200 entries to under 50 - fewer contradictions, tighter answers

The mechanism is straightforward. A smaller, higher-quality active memory set means less attention dilution. The model spends its "reasoning budget" on the right 20 facts instead of filtering through 200.

Fazm is an open source macOS AI agent. Open source on GitHub.

Memory Filters - Why AI Agents Need Aggressive Pruning

Memory Filters - Why AI Agents Need Aggressive Pruning

The Real Cost of Keeping Everything

Three Pruning Algorithms - When to Use Each

1. LRU (Least Recently Used) Eviction

2. Frequency-Based Scoring (LFU)

3. Relevance-Decay Scoring (Recommended)

How to Implement a Practical Memory Pipeline

What to Prune vs. What to Archive

The Numbers

More on This Topic

Related Posts

Memory Systems Are Graveyards - Less Context, Better Reasoning

What Does Remember Mean for an Agent? Store Everything, Prune 80%

Context Compaction Ate Our Agent's Memory