Agent Architecture

AI Agent Memory Architecture: Three Layers That Matter

Three memory layers, three different characters. An AI agent's working memory is frantic and detailed, holding every token of the current conversation. Its session summaries are editorial, compressing hours of work into key takeaways. Its long-term memory is factual, storing verified truths that persist across sessions. Designing these layers well is the difference between an agent that forgets everything between conversations and one that builds genuine understanding over time. This guide covers the practical architecture of agent memory systems.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. Why Memory Matters for AI Agents

A stateless AI agent is like an employee who forgets everything the moment they leave the office. Every morning, you have to re-explain the project, the codebase conventions, the client preferences, and the decisions from yesterday. This is the default state of most AI tools: each conversation starts from zero.

Memory transforms an agent from a stateless tool into something closer to a colleague. An agent with good memory remembers that you prefer TypeScript over JavaScript, that the production database is on the us-east-1 cluster, that the last deployment had a caching bug that was fixed by invalidating the CDN. These facts are not in any documentation. They live in the accumulated experience of working on the project.

The challenge is that "memory" is not a single thing. Different types of information need different storage strategies, different retrieval mechanisms, and different retention policies. Treating all memory the same leads to either context window bloat (storing everything in the prompt) or information loss (storing nothing between sessions).

The three-layer model provides a practical framework: working memory for the current session, session summaries for recent history, and long-term facts for persistent knowledge. Each layer has different characteristics in terms of detail, longevity, and retrieval cost.

2. Layer 1: Working Memory

Working memory is the conversation context window. It contains the full transcript of the current session: every message, every tool call, every response. It is high-fidelity, capturing exact details, code snippets, error messages, and decision rationale.

The primary constraint on working memory is the context window size. Modern models support 100K to 200K tokens, which is substantial but finite. A complex coding session with many file reads and tool calls can consume this budget in an hour or two. When the context window fills up, the agent must either truncate older messages or lose the ability to accept new input.

Effective working memory management involves several techniques. First, be selective about what enters the context. Loading entire files when you only need a function is wasteful. Second, summarize tool outputs rather than including raw results. A search that returns 50 results does not need all 50 in the context; the top 5 with a summary is usually sufficient. Third, use structured formats. JSON or markdown tables are more token-efficient than prose for representing structured data.

For desktop automation agents, working memory includes the current state of the applications being controlled. Which windows are open? What is the content of the active document? What menu items are available? Tools like Fazm that use accessibility APIs can query this state on demand rather than keeping a stale snapshot in memory, which means the working memory reflects the actual current state of the desktop rather than an outdated representation.

Agents that understand your Mac in real time

Fazm reads live application state through accessibility APIs instead of relying on stale screenshots in memory.

Try Fazm Free

3. Layer 2: Session Summaries

When a session ends, the working memory needs to be compressed into a summary that captures what happened, what was decided, and what remains to be done. This is the session summary layer. It trades detail for longevity.

A good session summary is structured, not a prose paragraph. It should contain: the task that was attempted, the outcome (completed, partially completed, blocked), key decisions made and their reasoning, files that were modified, and any open questions or follow-up tasks. This structured format makes summaries searchable and parseable by future agent sessions.

The compression ratio matters. A 2-hour session with 50K tokens of working memory should compress to 500 to 1,000 tokens of summary. This 50x to 100x compression means you can fit summaries from dozens of recent sessions into the context window of a new session, giving the agent a broad view of recent project history.

Session summaries should be generated automatically at the end of each session. Have the agent itself write the summary as its final action, while the full context is still available. Alternatively, use a separate summarization step that processes the session transcript after the fact. The first approach is simpler; the second allows for more sophisticated summarization with access to external context.

Store summaries in a format that supports both chronological browsing (what happened recently?) and semantic search (when did we last discuss the caching issue?). A SQLite database with a text column for the summary and a timestamp works well for chronological access. For semantic search, embed the summaries with a text embedding model and store the vectors alongside the text.

4. Layer 3: Long-Term Facts

Long-term facts are verified, persistent pieces of knowledge that remain true across sessions. They are extracted from session summaries and working memory but live independently. Examples include: "The project uses Next.js 14 with the app router," "The client prefers British English spelling," or "The staging environment URL is staging.example.com."

The key distinction between long-term facts and session summaries is that facts are deduplicated and updated. If three different sessions mention the database URL, the long-term memory should contain one entry for the database URL, not three. If the URL changes, the fact should be updated, not appended.

Fact extraction requires care. Not everything that comes up in a session is a long-term fact. "The build failed because of a typo in line 42" is a session event, not a persistent fact. "The project requires Node 20 or higher" is a persistent fact. The distinction is whether the information is likely to be relevant in future sessions that have no other context about this specific event.

Implement long-term facts as a key-value store with categories. Categories might include: project setup, coding conventions, team preferences, environment details, and known issues. When starting a new session, load the relevant category of facts into the context. A coding session loads project setup and coding conventions. A deployment session loads environment details.

Facts need a confidence score and a last-verified timestamp. Information decays. A fact recorded six months ago might no longer be true. When an agent encounters information that contradicts a stored fact, it should flag the discrepancy rather than silently using potentially stale data.

5. Putting It Together

A practical implementation uses three storage backends. Working memory lives in the model's context window (managed by the agent framework). Session summaries live in a SQLite database or JSON files, one per session. Long-term facts live in a structured file (like CLAUDE.md) or a small database that is loaded at session start.

The flow between layers is: during a session, everything goes into working memory. At session end, the working memory is compressed into a session summary and saved. Periodically (or at each session end), new long-term facts are extracted from recent session summaries and merged into the fact store.

At session start, the agent's context is primed with: the full long-term fact store (usually small enough to fit easily), summaries from the N most recent sessions, and any session summaries that are semantically relevant to the current task (retrieved via embedding search). This gives the agent broad persistent knowledge, recent context, and task-specific history.

For desktop automation agents, memory architecture enables powerful patterns. An agent that remembers your typical workflow (open these three apps, arrange windows this way, export data in this format) can execute future tasks more efficiently. Fazm and similar tools can benefit from remembering which accessibility tree paths correspond to frequently used UI elements, reducing the exploration needed in future sessions.

The key principle is: store at the right granularity for each layer. Working memory needs full detail. Session summaries need structured highlights. Long-term facts need verified, deduplicated knowledge. Getting this balance right means your agents get smarter over time without drowning in irrelevant context.

Smart agents that remember your workflow

Fazm learns how you use your Mac and automates repetitive tasks with real application understanding.

Try Fazm Free

Open source. Free to start. Local memory, no cloud required.