Building an LLM-Powered Data Janitor for Browser-Extracted Memories

Matthew Diakonov·March 17, 2026·2 min read

llm data-cleaning browser memories ai-agent automation

LLM-powered review skill for browser-extracted memories. Classify keep/delete/merge/fix in batches. Self-ranking via hit_rate. This is how you turn a messy pile of browser data into a useful knowledge base.

The Problem With Raw Browser Data

If you're running a persistent memory system that captures information from browser sessions, you know the data quality problem. Raw browser extractions include:

Duplicate entries from revisiting the same page
Outdated information from pages that changed since capture
Fragments that are too short to be useful on their own
Near-duplicates that say the same thing in slightly different words
Genuinely valuable insights buried under noise

Manual cleanup doesn't scale. Once you have thousands of memories, you need an automated janitor.

The Classification Pipeline

The LLM-powered janitor processes memories in batches of 20-50 and classifies each one:

Keep - the memory is accurate, unique, and useful
Delete - duplicate, outdated, or too fragmentary to be valuable
Merge - should be combined with another related memory
Fix - contains useful information but needs correction or reformatting

The classification prompt includes context about existing memories so the LLM can identify duplicates and merge candidates.

Self-Ranking via Hit Rate

The clever part is the self-ranking system. Each memory tracks a hit_rate - how often it gets retrieved and used in agent responses. Memories with high hit rates are clearly valuable. Memories that never get retrieved are candidates for deletion.

This creates a feedback loop:

The janitor classifies memories
The agent uses memories in its work
Usage data updates hit rates
The janitor uses hit rates to improve future classifications

Over time, your knowledge graph gets cleaner and more relevant without manual curation.

Batch Processing Matters

Running the janitor one memory at a time wastes tokens on repeated context. Batch processing lets the LLM see related memories together, making merge detection much more accurate. A batch size of 20-50 balances token efficiency with classification quality.

The result is a memory system that maintains itself - getting more useful over time instead of degrading into noise.

Fazm is an open source macOS AI agent. Open source on GitHub.

Building an LLM-Powered Data Janitor for Browser-Extracted Memories

The Problem With Raw Browser Data

The Classification Pipeline

Self-Ranking via Hit Rate

Batch Processing Matters

More on This Topic

Related Posts

Auto Parts Ecommerce - AI Agents for Catalog Automation

I Hate Being Human Glue Between AI Steps - Spec File as the Deliverable

Invisible Infrastructure in AI Agent Systems - The Scripts That Run Silently

Comments ()

The Problem With Raw Browser Data

The Classification Pipeline

Self-Ranking via Hit Rate

Batch Processing Matters

More on This Topic

Related Posts

Auto Parts Ecommerce - AI Agents for Catalog Automation

I Hate Being Human Glue Between AI Steps - Spec File as the Deliverable

Invisible Infrastructure in AI Agent Systems - The Scripts That Run Silently

Comments (••)

Comments ()