Field notes on agent memory

AI agent memory management, and the case for keeping the whole transcript

Almost every guide on this subject sends you to a vector database. Embed each turn, summarize the rest, retrieve the fragments that look relevant. That is the right answer for a product with a million users. It is the wrong answer for an agent serving one person on one Mac, and this page is about the design that fits the smaller case.

M
Matthew Diakonov
11 min read

Direct answer, verified 2026-05-17

AI agent memory management is how an agent stores, maintains, and recalls information across turns and sessions. The standard treatment, captured well in IBM's overview of AI agent memory, describes it as a write-manage-read loop on top of a stateless model: a separate memory layer that embeds, summarizes, prunes, and retrieves. That design exists because the layer assumes the full history will never fit. For a single-user agent it does fit. The simpler design, the one this page argues for, is to write the whole conversation verbatim to one local database and never compact the durable copy.

The playbook everyone teaches

Open any current write-up on agent memory and the shape is the same. A language model is stateless: it does not remember the previous API call, so any continuity you see in ChatGPT or Claude is a memory system layered on top. Build agents of your own and you build that layer yourself. The layer is usually drawn as a hierarchy. Working memory is the papers on your desk, immediate but small. Short-term memory is the filing cabinet, recent and quick to reach. Long-term memory is the archive, vast and slower to search.

On top of that hierarchy sits the loop: write new observations, manage the store by pruning and compressing and consolidating, then read the relevant pieces back into context when the agent needs them. Long-term memory is almost always a vector database. Each turn gets chunked, each chunk gets embedded, the embeddings get searched by similarity, and a summarizer keeps the whole thing from growing without bound. Cloudflare, Redis, Databricks, and Letta all ship a version of this, and the framing is sound for the problem they are solving.

It is worth being precise about what makes that loop hard. Not the write. The write is easy. The hard parts are the manage step (what do you forget, what do you compress, and when) and the read step (given a question, which of ten thousand fragments do you pull). Both are hard for one reason only: the design has already decided that keeping the raw history is not an option. Every difficulty in mainstream agent memory management is downstream of that single assumption.

Where that assumption comes from

The retrieval pipeline was designed by and for teams running agents as a service. In that world the assumption that the transcript will not fit is correct. You have many thousands of users, each producing history every day. You may need an agent to recall something a different user established months ago. You run for years without a natural reset. At that scale, raw history genuinely does outgrow any context window and any reasonable storage budget, and you have no choice but to extract, compress, and retrieve.

A personal desktop agent breaks every one of those assumptions. There is one user. The history is one human typing and talking for hours, a few days, maybe a few weeks before the task is done and a new chat starts. Text is cheap: a long, tool-heavy conversation is still only megabytes. There is no cross-user recall problem because there is no other user. When the design constraints are this different, copying the SaaS playbook is not caution, it is overhead. You inherit the manage step and the read step, with all their failure modes, to solve a scaling problem you do not have.

Two answers to the same question

Memory management for a service and memory management for one person's machine are not the same problem. Holding them side by side makes the mismatch obvious.

FeatureMemory-layer playbookTranscript-first (Fazm)
Where memory livesA vector database running as a separate serviceOne local SQLite file, fazm.db, on the user's machine
Write stepChunk the turn, embed each chunk, summarize, upsertINSERT the message text verbatim, one row
Recall stepApproximate nearest-neighbor search over embeddingsLoad the rows back by conversation, or FTS5 keyword match
What can be lostAnything a chunker, summarizer, or retriever skipsNothing. The transcript is the record
Worst failure modeWrong fragment retrieved; agent acts on a half-memoryDisk full or DB corruption: loud, detected, recoverable
Scales comfortably toMillions of users, years of accumulated historyOne person's machine, a transcript measured in megabytes

The last row is the honest catch: transcript-first does not scale to a multi-tenant backend. That is the boundary, not a flaw. Match the design to the scale you actually run at.

What Fazm actually stores

Fazm is a native macOS app that wraps Claude Code and Codex. Its durable memory is not a service and not a vector index. It is one SQLite file, fazm.db, written to ~/Library/Application Support/Fazm/users/<uid>/fazm.db, and inside it one table holds every message you have ever exchanged: chat_messages.

The table has a small history of its own, and it is worth tracing because it shows the design holding steady. Migration fazmV2 created it as task_chat_messages for onboarding message persistence. fazmV3 renamed it to chat_messages once it became the generic store for all conversations. fazmV5 added a session_id column to separate conversations and an FTS5 virtual table, chat_messages_fts, for fast keyword search. Across five migrations the durable store never grew a summarizer, a pruner, or an embedding column. It stayed a flat log of rows. You can read all of this in Desktop/Sources/AppDatabase.swift in the public repository.

Because the file is plain SQLite, you do not have to take any of this on faith. Open it yourself.

reading your own agent memory

Eight columns, no embeddings, no summary blobs. The backendSynced default of 0 is the tell: the local row is the source of truth, and the agent works whether or not anything is ever synced to a server. Search runs through chat_messages_fts as exact full-text matching, so you find the message you wrote, not the nearest vector to it.

The write path, side by side

The clearest way to see the difference is the code that runs when a single turn is committed to memory. On the left, the shape of a typical memory-layer write: chunk, embed, summarize, upsert, with two network round trips before the data is even stored. On the right, the actual write path from Desktop/Sources/ChatMessageStore.swift: one SQL statement, no network, no model calls.

Committing one turn to memory

# typical "memory layer" write path
for chunk in chunk(turn.text):
    vec = embed(chunk)              # network call
    summary = summarize(chunk)      # network call
    store.upsert(
        id=uuid4(),
        vector=vec,
        metadata={
            "user": user_id,
            "ts": now(),
            "summary": summary,
        },
    )
# and later: prune, re-embed, consolidate
36% lines, and zero network calls

The right side cannot retrieve the wrong thing, because it never decided what to keep. It cannot summarize away a decision you made forty turns ago, because it never summarizes. The INSERT OR REPLACE means a streamed message that updates as tokens arrive overwrites its own row cleanly rather than duplicating. That is the whole write path. Memory management, on this design, is mostly the discipline of not adding machinery you do not need.

Write, manage, read, with the manage step removed

Keep the standard loop as the frame and see what happens to each step. Write becomes the single INSERT above, called once per message from ChatMessageStore.saveMessage. Nothing is chunked, embedded, or summarized on the way in.

Manage nearly disappears. There is no pruning, because rows are cheap and the conversation is finite. There is no consolidation, because nothing was ever split apart to need rejoining. The only thing that resembles management is an append-only chain of upstream session IDs, which exists not to shrink memory but to keep older rows reachable after a session rolls over. That mechanism has its own write-up, and it is additive, never destructive.

Read has two forms. The common one is to load the rows straight back by conversation, in order, which is what restoring a window does. The other is keyword search through the FTS5 table when you want to find an old thread. Neither form is approximate. Neither can return a fragment that the model never actually saw. The expensive, error-prone middle of the mainstream loop is simply gone, and the two ends are deterministic.

Two memory layers, and which one is allowed to forget

"Never compact" needs one honest qualification. There are two distinct layers, and only one of them is durable. Conflating them is where most confusion about agent memory starts.

Working memory

The live model context window

Ephemeral. The Claude Code SDK underneath Fazm can auto-compact this window when it is about to overflow. Fazm adds no compaction of its own, and it surfaces the SDK's as a visible compact_boundary event with the trigger and pre-compaction token count, so a summary never swaps in silently.

Durable memory

The fazm.db transcript

Permanent for the conversation's life. Every message is already on disk, verbatim, the moment it is sent. Whatever the live window does, the durable record is complete. A restart, a fork, or a session rollover all rebuild from this layer.

The mainstream playbook tries to make one store do both jobs, which is why it has to compress: the thing it retrieves from is also the thing it feeds the model. Splitting the two means the durable layer is free to be a dumb, complete log, and the lossy behavior is confined to the live window where it belongs and where you can see it happen.

The three places memory normally breaks

A desktop agent loses memory in three concrete moments, and a transcript-first design has a plain answer for each.

Restart. Quit the app, or reboot the Mac. Because every message was already written to fazm.db, the durable record survives the process dying. On launch each window restores from the file. The floating bar restores its most recent 50 messages, set by the floatingRestoreLimit constant; detached windows reload their full feed keyed by conversation.

Fork. Branching a chat calls session/fork. The new branch inherits the entire prior conversation as context, and the source session stays alive on disk, reachable through Conversation History. Neither side is destroyed, and nothing is copied or re-embedded, because the transcript is append-only and keyed by conversation.

Session rollover. The upstream model session is a transient handle. A rate limit, a credit cap, or a bridge restart can invalidate it, and the SDK answers with a fresh session ID. When that happens Fazm replays the recent history as a preamble, capped at the last 20 turns by the MAX_REPLAY constant in the bridge, and it spans the session-ID chain so the replay sees messages from before the rollover, not just after it. The model picks the thread back up instead of waking as a stranger mid-task.

When the retrieval pipeline is still the right call

This is not an argument that vector databases are wrong. It is an argument that they answer a question most personal agents are not asking. There are real cases where the full memory-layer machinery is the correct answer, and it is worth naming them so the boundary is clear.

If you run a multi-tenant product, you cannot keep every user's raw transcript live and queryable forever; extraction and retrieval are how you stay within a storage and latency budget. If your agent must carry knowledge between sessions that never overlap, for example recalling a fact a user gave it three months ago without replaying three months of chat, you need a store that holds distilled facts, not raw turns. If an agent runs unattended for months, the transcript really does outgrow any window, and a managed service like Cloudflare Agent Memory that extracts and recalls without filling the context is a reasonable buy. And if you need semantic recall, finding related ideas that share no exact words, keyword search will not do it and embeddings will.

The honest framing is a fork in the road, not a winner. Transcript on one branch, retrieval pipeline on the other. The mistake the common advice makes is pretending there is only one road.

How to choose for your own agent

The deciding question is not technical taste, it is scale. Ask whether the complete history of a single conversation comfortably fits in storage and in a context window over the agent's working life. For a personal agent the answer is almost always yes, and the transcript-first design wins on every axis that matters: it is lossless, it has no retrieval-relevance bug, it is auditable with the sqlite3 CLI, and it is faster because there is no embedding step on the hot path.

If the answer is no, you are at SaaS scale, and you should build the layer the mainstream guides describe, with eyes open about the manage and read steps you are taking on. What you should not do is reach for a vector database on day one for an agent that serves you and only you. That is solving a problem you do not have and importing failure modes you did not need.

Fazm is the worked example of the smaller answer. The agent loop is the real Claude Code; the memory is one SQLite file you can open and read. It is free to start, runs locally, and the full source, including every line referenced here, is on GitHub.

Designing memory for an agent you actually ship?

Talk through whether a transcript-first store or a retrieval pipeline fits what you are building, with someone who has shipped one.

AI agent memory management: common questions

What is AI agent memory management, in one sentence?

It is how an agent stores, maintains, and recalls information across turns and sessions, usually described as a write-manage-read loop: new information is written, the store is maintained (pruned, compressed, consolidated), and relevant pieces are read back into the model context. The hard parts (what to forget, what to compress, which fragment to retrieve) only exist once you decide the transcript is too large to keep whole. For a single-user agent it usually is not.

Do I need a vector database for agent memory?

Not for a personal agent. A vector database earns its place when you have many users, years of history, or knowledge that must be shared across sessions that never overlap in time. For one person on one machine, a conversation history is measured in megabytes of text. Storing it verbatim in a local SQLite file and loading it back is faster, lossless, and has no retrieval-relevance failure mode. Fazm stores every message in a single chat_messages table and never embeds anything.

What happens to my conversation when I restart the Mac?

Nothing is lost. Every message is written to the local fazm.db file as it is sent, so the durable record exists independent of whether the app is running. On launch Fazm restores each window from that file. For the floating bar the restore is capped at the most recent 50 messages (the floatingRestoreLimit constant in ChatProvider.swift); detached windows reload their full feed by taskId. The upstream model session is resumed separately, and if that resume fails the recent history is replayed as a preamble so the agent picks the thread back up instead of waking as a stranger.

Does Fazm ever compact or summarize my history?

The durable copy, never. The live model context window is a separate layer, and the Claude Code SDK underneath can still auto-compact that window when it is about to overflow. Fazm does not add its own compaction on top, and it surfaces the SDK's compaction as a visible compact_boundary event (handled in acp-bridge/src/index.ts) carrying the trigger and the pre-compaction token count, so a summary never replaces your history silently. Either way the SQLite transcript stays whole.

Where is the conversation data stored, and is it sent anywhere?

It lives in ~/Library/Application Support/Fazm/users/<your-uid>/fazm.db, a plain SQLite database on your machine. Messages are written with backendSynced set to 0, meaning the local row is the source of truth and no server copy is required for the agent to function. You can open the file with the sqlite3 CLI and read your own history directly.

How does forking a chat affect memory?

Forking calls session/fork on the bridge. The branch keeps the full prior conversation as context, and the source session stays alive on disk and remains reachable through Conversation History, so neither branch is destroyed. Because the persisted transcript is append-only and keyed by conversation, a fork is cheap: it does not copy or re-embed anything, it just starts a new branch that already sees everything that came before it.

Can I search across old conversations?

Yes. Migration fazmV5 adds an FTS5 virtual table, chat_messages_fts, that indexes messageText and stays in sync through insert and update triggers. Search is exact keyword full-text matching over the literal transcript, not approximate vector similarity. You get the message you actually wrote, not the nearest embedding to it.

When should I build a real memory layer instead?

When your scale breaks the assumption that the transcript fits. Multi-tenant products, agents that run unattended for months, or systems that must carry knowledge between users all need extraction, compression, and retrieval, because keeping every raw turn forever stops being practical. The transcript-first design is right for a personal desktop agent and wrong for a large SaaS backend. Pick the one that matches your scale, not the one that is fashionable.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.