Memory Filters - Why AI Agents Need Aggressive Pruning
Memory Filters - Why AI Agents Need Aggressive Pruning
Most agent memory systems are hoarders. They save everything - every conversation, every fact, every preference the user mentioned once six months ago. The result is a bloated context that slows everything down and confuses the agent with outdated information.
The fix is aggressive filtering: only keep facts that were retrieved and actually used within a meaningful window. Research from Mem0 shows that fact-based memory with intelligent pruning requires roughly 1,800 tokens per conversation compared to 26,000 tokens when sending full context history - a 90% reduction in token consumption.
The Real Cost of Keeping Everything
Context windows are finite. Even with models that support 128K or 200K tokens, stuffing all of that capacity with memory records is wasteful and counterproductive. The agent spends tokens processing irrelevant information, and important recent context gets diluted.
Worse, stale memories actively mislead the agent. If a user changed their preferred email six months ago but the agent still has the old one in memory, it will use the wrong address. In the MemoryAgentBench benchmark, agents using pure FIFO (first-in, first-out) buffers that retain everything performed significantly worse on conflict resolution tasks than agents with pruning policies - the stale data corrupted their answers.
The practical cost shows up in three ways:
- Token spend: Sending 26K tokens instead of 1.8K per request multiplies your LLM API bill by 10-15x
- Latency: Larger context windows increase time-to-first-token; Zep's Temporal Knowledge Graph cut latency by 90% over naive retrieval baselines
- Answer quality: More memory is not better memory - a Mem0 study found a 26% accuracy boost when switching from full-context to structured fact-based memory with pruning
Three Pruning Algorithms - When to Use Each
1. LRU (Least Recently Used) Eviction
LRU is the simplest pruning strategy. When your active memory store reaches capacity, you evict whichever memory was accessed least recently.
from collections import OrderedDict
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
@dataclass
class MemoryEntry:
key: str
value: str
created_at: datetime = field(default_factory=datetime.utcnow)
last_retrieved: Optional[datetime] = None
retrieval_count: int = 0
class LRUMemoryStore:
def __init__(self, capacity: int = 50):
self.capacity = capacity
self._store: OrderedDict[str, MemoryEntry] = OrderedDict()
def get(self, key: str) -> Optional[str]:
if key not in self._store:
return None
# Move to end (most recently used)
self._store.move_to_end(key)
entry = self._store[key]
entry.last_retrieved = datetime.utcnow()
entry.retrieval_count += 1
return entry.value
def put(self, key: str, value: str) -> None:
if key in self._store:
self._store.move_to_end(key)
self._store[key] = MemoryEntry(key=key, value=value)
if len(self._store) > self.capacity:
# Evict least recently used (first item)
evicted_key, _ = self._store.popitem(last=False)
# Optionally archive evicted_key to cold storage here
def active_count(self) -> int:
return len(self._store)
LRU works well for conversational facts where recency is a strong signal. If a user told you their project name last week and you haven't needed it since, it's probably stale. The weakness: LRU ignores frequency. A memory accessed 50 times two weeks ago will be evicted before one accessed once yesterday.
2. Frequency-Based Scoring (LFU)
LFU (Least Frequently Used) keeps the memories your agent actually relies on, regardless of when they were last accessed. This suits long-lived facts like user preferences, role context, or recurring workflows.
import heapq
class LFUMemoryStore:
def __init__(self, capacity: int = 50):
self.capacity = capacity
self._store: dict[str, MemoryEntry] = {}
self._freq_heap: list[tuple[int, str]] = [] # (count, key)
def get(self, key: str) -> Optional[str]:
if key not in self._store:
return None
entry = self._store[key]
entry.retrieval_count += 1
entry.last_retrieved = datetime.utcnow()
return entry.value
def put(self, key: str, value: str) -> None:
if len(self._store) >= self.capacity:
# Rebuild heap and evict minimum frequency entry
self._freq_heap = [
(e.retrieval_count, k)
for k, e in self._store.items()
]
heapq.heapify(self._freq_heap)
_, lfu_key = heapq.heappop(self._freq_heap)
del self._store[lfu_key]
self._store[key] = MemoryEntry(key=key, value=value)
LFU keeps your most-used memories even if they haven't been accessed in a while. The downside: very old high-frequency memories never get evicted even when they're no longer relevant.
3. Relevance-Decay Scoring (Recommended)
The most effective approach combines recency, frequency, and importance into a single score - then prunes anything below a threshold. This mirrors how the MemOS and Agentic Memory frameworks approach the problem.
import math
from datetime import datetime, timedelta
class RelevanceDecayStore:
def __init__(
self,
capacity: int = 50,
decay_days: float = 30.0,
prune_threshold: float = 0.1,
):
self.capacity = capacity
self.decay_days = decay_days
self.prune_threshold = prune_threshold
self._store: dict[str, MemoryEntry] = {}
def _score(self, entry: MemoryEntry) -> float:
"""
Score = frequency_weight * recency_decay * importance_boost
- frequency_weight: log(1 + retrieval_count) normalizes burst access
- recency_decay: exponential decay based on days since last retrieval
- importance_boost: optional multiplier for pinned/critical facts
"""
now = datetime.utcnow()
last_access = entry.last_retrieved or entry.created_at
days_since_access = (now - last_access).total_seconds() / 86400
frequency_weight = math.log1p(entry.retrieval_count)
recency_decay = math.exp(-days_since_access / self.decay_days)
return frequency_weight * recency_decay
def prune(self) -> list[str]:
"""Remove memories below threshold. Returns list of pruned keys."""
to_prune = [
key for key, entry in self._store.items()
if self._score(entry) < self.prune_threshold
]
for key in to_prune:
del self._store[key]
return to_prune
def put(self, key: str, value: str) -> None:
self._store[key] = MemoryEntry(key=key, value=value)
if len(self._store) > self.capacity:
# Sort by score ascending, evict lowest
sorted_keys = sorted(self._store, key=lambda k: self._score(self._store[k]))
del self._store[sorted_keys[0]]
def get(self, key: str) -> Optional[str]:
if key not in self._store:
return None
entry = self._store[key]
entry.last_retrieved = datetime.utcnow()
entry.retrieval_count += 1
return entry.value
With decay_days=30 and prune_threshold=0.1, a memory accessed once and untouched for 40 days gets a score of log(2) * exp(-40/30) ≈ 0.693 * 0.26 ≈ 0.18 - above threshold. After 60 days with zero additional access: 0.693 * exp(-60/30) ≈ 0.094 - pruned. Adjust decay_days up for long-lived agents, down for high-churn task agents.
How to Implement a Practical Memory Pipeline
A complete agent memory pipeline needs three tiers, not just one hot store:
┌─────────────────────────────────────────────────────────┐
│ ACTIVE (hot) │ max 50 entries │ relevance-decay │
│ Injected into every prompt │
├─────────────────────────────────────────────────────────┤
│ ARCHIVE (warm) │ unlimited │ semantic search │
│ Retrieved on demand via similarity query │
├─────────────────────────────────────────────────────────┤
│ COLD STORAGE │ unlimited │ exact key lookup │
│ Rarely accessed; compliance, audit trail │
└─────────────────────────────────────────────────────────┘
Step 1 - Classify on write. When the agent learns a new fact, assign it a tier immediately. User preferences, active project names, and recurring constraints go to active. Completed task details and old conversation summaries go to archive.
Step 2 - Track actual use, not just retrieval. Retrieval means the memory was fetched into context. Use means the agent cited or acted on it. Only increment retrieval_count when the agent output shows evidence of using the memory - otherwise retrieval from a semantic search artificially inflates frequency for unrelated noise.
Step 3 - Run a scheduled pruning job. Do not prune on every write - it's expensive. Instead, run a daily job that calls store.prune() and moves evicted entries to the archive tier:
import schedule
def daily_prune(active_store: RelevanceDecayStore, archive_store):
pruned_keys = active_store.prune()
for key in pruned_keys:
# Don't delete - demote to warm archive
archive_store.put(key, original_values[key])
print(f"Pruned {len(pruned_keys)} entries from active memory")
schedule.every().day.at("02:00").do(daily_prune, active_store, archive_store)
Step 4 - Cap active memory at 50 entries. This is the key constraint that forces discipline. When the active store is full, any new write triggers an eviction. The agent is forced to decide whether new information is worth keeping over existing memories. In practice, 20-50 facts covers the relevant context for nearly all personal agent use cases.
What to Prune vs. What to Archive
Not all low-scoring memories should be deleted. Use this decision matrix:
| Memory type | Low score action |
|---|---|
| User preferences (email, name, role) | Archive - never delete |
| Task context from completed project | Archive after 30 days |
| Temporary state (current file open, last command) | Delete immediately after session |
| Frequently wrong facts (corrected by user) | Delete + log correction |
| Conversation summaries | Archive with compression |
The counterintuitive insight: deleting is rarely the right call. Archiving with semantic search handles edge cases where old context becomes suddenly relevant. What you want to eliminate from the active set is noise - low-score memories that pollute every prompt but rarely contribute.
The Numbers
Keeping active memory under 50 facts and pruning aggressively produces measurable gains:
- Mem0's production data: 90% fewer tokens per conversation vs. full-context injection
- Zep Temporal Knowledge Graph: 18.5% accuracy improvement on long-horizon tasks, 90% latency reduction vs. naive retrieval
- Internal tests (Fazm): Agent response coherence improved noticeably when active memory dropped from ~200 entries to under 50 - fewer contradictions, tighter answers
The mechanism is straightforward. A smaller, higher-quality active memory set means less attention dilution. The model spends its "reasoning budget" on the right 20 facts instead of filtering through 200.
Fazm is an open source macOS AI agent. Open source on GitHub.