Local LLM workflow literacy, the five primitives that turn a chatbox into work

Most guides on local models stop at “you have a chatbot running.” Most guides on AI literacy never touch a local model at all. The gap between those two is where almost all of the actual work happens, and it has a shape: five concrete primitives you need to be operationally fluent with before a local LLM stops feeling like a toy. Each one below is anchored to a specific file in the open source Fazm desktop app so you can verify it instead of taking my word for it.

Matthew Diakonov, Written with AI

Published May 5, 20269 min read

Direct answer (verified 2026-05-05)

Local LLM workflow literacy is fluency in five operational primitives: the agent loop versus a single prompt, the screen-state representation choice, the swappable model endpoint, skills as durable workflow units on disk, and persistent memory files reloaded at session start. The whole page below is one paragraph per primitive plus a self-check.

“The exact number of .skill.md files Fazm bundles in Desktop/Sources/BundledSkills/ and auto-installs to ~/.claude/skills/. The Observer creates new ones only after a workflow repeats three or more times.”

github.com/m13v/fazm, ChatPrompts.swift line 607

Primitive 1 of 5

The agent loop, not the prompt

A chat is one round trip. A model produces an answer to one input and the conversation moves on. An agent is a loop. On every turn the runtime concatenates the system prompt, the full tool schema, the conversation so far, and a fresh dump of the world state, then the model emits a tool call, the runtime executes the tool, appends the result, and goes around again. The same model behind the same chat box can feel snappy in one shape and unusable in the other, because the loop pays the input cost on every iteration.

The literacy bit is internalising that the unit of thought is no longer a prompt but a turn shape. When something feels slow, ask what is in the input on each turn. When something feels stupid, ask what was missing from the input on the turn it failed. When something feels expensive, count the tokens that are repeating every iteration unchanged. Most surprises about local agents evaporate once you start thinking in turns.

Primitive 2 of 5

Screen-state representation is the variable that decides everything

If the agent has to act on what is on screen, every turn carries a representation of the screen. There are two viable choices and the gap between them is enormous. A screenshot fed to a vision model lands somewhere in the 1,500 to 3,000 input-token range depending on resolution and tile policy. A compact accessibility tree of the same window, where each element becomes a tiny record of role, text, and bounding box, lands in the 200 to 400 token range. Both describe the same window. One is six times more expensive on every turn.

On a frontier cloud model with thousands of tokens per second of prefill you can ignore this. On a local 13B at hundreds of tokens per second of prefill you cannot. The downstream effect is that an accessibility-tree agent on consumer Apple Silicon finishes a 10-step task while a screenshot-based agent on the same hardware is still on step three. Knowing this is what separates “my local model is too dumb” from “my representation is too fat.”

Same window, two representations, one big number

Render the active window as a PNG, hand it to a vision model. Easy to set up, works on any app, and is the default in most computer-use research demos.

1,500 to 3,000 input tokens per turn
Vision model required for every reasoning step
150,000 cumulative tokens over a 10-step task
Loop dominated by prefill on consumer Apple Silicon

Primitive 3 of 5

The reasoner is a UserDefault, not a recompile

The agent process and the model behind it are separate things. The agent owns the loop, the tool schema, the screen-state capture, and the rules about when to stop. The model only has to read the input and emit the next tool call. Treat them as one and you will rebuild your stack every time you swap models. Treat them as separate and a single line of config flips you from cloud to local with the same skills, the same memory, and the same workflow.

In Fazm this is exactly one @AppStorage entry called customApiEndpoint, declared at SettingsPage.swift line 885 and surfaced in the Settings panel as a single text field. At session start ACPBridge.swift reads that value and exports it as ANTHROPIC_BASE_URL on the spawned bridge process. The bridge does not know whether the URL points at api.anthropic.com, an OpenRouter Anthropic endpoint, a corporate proxy, or a llama.cpp server with a small shim in front of it. That is the boundary you want.

Desktop/Sources/Chat/ACPBridge.swift

Primitive 4 of 5

Skills are durable workflow units, not better prompts

The first time you ask a local agent to do something, you write a prompt. The second time, you copy the prompt. The third time, you should have a skill. A skill is a markdown file on disk at ~/.claude/skills/<name>/SKILL.md that the runtime loads when its trigger matches. It survives process restarts, lives outside any single chat, is editable with whatever editor you already use, and ships with the rest of your dotfiles.

Fazm bundles seventeen of these skills in Desktop/Sources/BundledSkills/ as .skill.md files (pdf, docx, xlsx, pptx, video-edit, frontend-design, deep-research, social-autoposter, telegram, web-scraping, and so on). On first run, SkillInstaller.swift copies them to ~/.claude/skills/ and keeps them in sync via SHA-256 checksum. Crucially, the Observer prompt at ChatPrompts.swift line 607 says new skills should only be created “when you detect a repeated workflow (3+ times).” That threshold is the literacy unit. A one-off task does not justify a skill. A thing you have done three times this week does.

The threshold rule

A skill costs about ten minutes to write and zero ongoing attention to maintain. A repeated copy-pasted prompt costs nothing once and a small tax on every reuse. The break-even point sits exactly where Fazm puts it: three. If you have done a workflow three or more times, the prompt has graduated. Promote it to a markdown file under ~/.claude/skills/ and stop rewriting it.

Primitive 5 of 5

Memory is files, not magic

The prompt window is what the model sees on a single turn and disappears the moment the process exits. Memory is whatever survives. The simplest implementation, and the one Fazm uses, is a flat MEMORY.md plus per-topic markdown files in ~/.claude/projects/<workdir>/memory/. The runtime auto-loads MEMORY.md at session start, the model reads topic files when relevant, and a parallel Observer process decides what is worth saving as it watches the conversation.

This is mundane on purpose. The files are plain markdown. You can grep them, edit them in vim, diff them, check them into git if you want versioning, or rsync them between machines. There is no embedding model, no vector store, no service to keep alive. Once you understand that “long-term memory” for a local agent is just files the runtime reads at session start, the rest of the picture clarifies. Memory is how you stop re-explaining yourself to the model. It is not a database feature, it is a filesystem habit.

Self-check, are you fluent

One question per primitive. Five yes answers and you have it. Three or fewer and the next agent task you run will surprise you for reasons that will feel mysterious until you fill the gap.

The five-question literacy check

I can sketch one full agent turn on paper, including what gets re-sent on every iteration unchanged.
I know roughly what a screenshot of a window costs in input tokens versus an accessibility tree of the same window.
I can name where the agent process gets its model from and how to point it at a different reasoner without rebuilding.
I know the threshold at which a one-shot prompt should become a skill, and where that skill lives on disk.
I can name what survives between sessions, where it lives, and how to edit it with a normal editor.

Where this came from

All five primitives are visible in the source of the Fazm desktop app under Desktop/Sources/. The agent loop and the bridge process live in Chat/ACPBridge.swift. The screen-state representation work is in the MacosUseSDK accessibility traversal code. The swappable reasoner is one @AppStorage at MainWindow/Pages/SettingsPage.swift line 885 wired into ANTHROPIC_BASE_URL at ACPBridge.swift line 467 to 470. The seventeen bundled skills are in Desktop/Sources/BundledSkills/, installed by SkillInstaller.swift, and the Observer that decides when to create new ones is described in Chat/ChatPrompts.swift line 581 onward. Memory routing lives in the same file. Open the repo, follow the references, and you can trace any one primitive end to end in under twenty minutes.

The Mac specifics will not all transfer to Linux or Windows; the framing will. AT-SPI on Linux and UI Automation on Windows give you the same accessibility-tree leverage. Skills as filesystem artifacts work anywhere. The ANTHROPIC_BASE_URL trick works anywhere. The agent loop is identical everywhere. The five primitives are a portable mental model that happens to be cheaply verifiable on a Mac because Fazm is open source.

Want to see the five primitives running on your machine

Twenty minutes, screen share, your repeating workflow. We point Fazm at your local model, write the first skill together, and watch the agent loop do something you have done by hand three times this week.

Questions people ask after reading this

What does 'local LLM workflow literacy' actually mean?

Fluency in the five operational primitives that sit between 'I have ollama running' and 'this thing is doing my recurring work.' Specifically: the agent loop (a model called repeatedly with tool results, not one prompt), the screen-state representation choice (accessibility tree vs screenshot), the swappable reasoner (knowing the agent process and the model are separate things), skills (durable instructions for repeated workflows, not better prompts), and persistent memory (files on disk the model reloads at session start). Get those five and the rest of the local-AI stack starts making sense as one system instead of a pile of unrelated tools.

Why is this not just 'prompt engineering' under a new name?

Prompt engineering optimises a single shot. Workflow literacy assumes the loop. The unit of thought is no longer 'what do I type into the box,' it is 'what does the system feed the model on every turn, what tools does it expose, and what state survives between turns.' A clever prompt cannot fix a system that re-sends 12,000 tokens of screenshot every iteration; only a representation choice can. A clever prompt cannot make a workflow stick across sessions; only a memory file or a skill can. Different vocabulary, different leverage points.

Where can I see the five primitives in real code?

Fazm is fully open source on GitHub at github.com/m13v/fazm and every primitive on this page maps to one or two files in the Desktop SPM package. The agent loop lives in Desktop/Sources/Chat/ACPBridge.swift around line 467. Screen-state representation lives in MacosUseSDK/AccessibilityTraversal.swift, where the cap constants are visible at the top of the file. The swappable reasoner is one @AppStorage line at SettingsPage.swift line 885 wired into ANTHROPIC_BASE_URL at ACPBridge.swift line 467 to 470. Skills are 17 .skill.md files in Desktop/Sources/BundledSkills/, installed by SkillInstaller.swift. Memory is the MEMORY.md plus topic file convention referenced in ChatPrompts.swift line 590.

What is the screen-state representation choice and why does it matter so much?

Two ways to tell a model what is on screen. Way one: a PNG screenshot, which a vision model tokenises into roughly 1,500 to 3,000 input tokens per turn. Way two: a compact accessibility tree, where each interactive element becomes a tiny JSON record of role, text, and bounding box, which lands in the 200 to 400 token range for the same window. Across a 10-step task this is the difference between 25,000 and 150,000 input tokens. On a local 13B model this is the difference between two minutes and twelve minutes for the same job. Until you grok this, the model will feel slow and you will blame the model.

Why are skills the right unit for a repeated workflow, not better prompts?

A prompt lives for one session. A skill is a markdown file on disk under ~/.claude/skills/<name>/SKILL.md that the model reads at the start of any session where its name or trigger matches. The Observer prompt in Fazm at Desktop/Sources/Chat/ChatPrompts.swift line 607 specifies that a new skill is only created when a workflow has been observed three or more times, which is exactly the threshold where a one-shot prompt is wasted effort. Once a workflow is a skill, it is durable, versioned, editable like any other file, and shareable across machines.

Does any of this actually require a local model?

No, and that is the point of the swappable reasoner literacy. The agent process is one program; the model behind it is whatever speaks the right protocol. Fazm exposes a Custom API Endpoint setting at MainWindow/Pages/SettingsPage.swift line 885 that simply writes to a UserDefault called customApiEndpoint. At chat session start, ACPBridge.swift line 467 to 470 reads that value and sets ANTHROPIC_BASE_URL on the spawned bridge process. Point that at a local Anthropic-compatible gateway in front of llama.cpp or vLLM and you are fully local. Point it at OpenRouter and you are not. Same agent, same skills, same memory, different reasoner. Treating that boundary as a UserDefault, not a recompile, is the literacy unit.

What is the difference between memory and the prompt window?

The prompt window is what the model sees on a single turn and disappears the moment the process exits. Memory is files on disk that get re-read at the start of every new session. The Fazm Observer at ChatPrompts.swift line 588 to 590 routes interesting facts to a MEMORY.md plus per-topic markdown files in ~/.claude/projects/<workdir>/memory/. The model loads MEMORY.md at session start automatically and pulls topic files when relevant. This is mundane filesystem state, not a vector database; you can grep it, edit it in vim, and check it into git. Knowing that 'long-term memory' is just files is half the literacy.

How do I self-check whether I have this literacy?

Ask yourself five questions. Can I sketch the loop the agent runs on every turn (system prompt, tool schema, history, screen state, model emits a tool call, tool runs, repeat)? Can I quote roughly what a screenshot costs in tokens vs an accessibility tree of the same screen? Can I name where the agent process gets its model from and how to point it at a different one? Can I name the threshold at which a one-shot prompt should become a skill, and where that skill lives on disk? Can I describe what survives between sessions and where? Five yeses and you have it. Three or fewer and the next agent task you run will surprise you for reasons that will feel mysterious until you fill the gap.

Is this Mac-specific or does the framing apply on Linux and Windows?

The framing applies anywhere. The specific implementations are Mac-shaped because Fazm uses macOS accessibility APIs, AppleScript-style automation, and Swift, but the five primitives translate. On Linux the screen-state representation is AT-SPI or Wayland-side a11y tree. On Windows it is UI Automation. The agent loop is identical. The swappable endpoint is identical. Skills are filesystem-based and work anywhere. Memory files are filesystem-based and work anywhere. The literacy is portable; the SDK names change.

Where do I start if I am at zero?

In order. One, run a local model with ollama or LM Studio so you have a chat that works offline. Two, point a computer-use agent at it and watch one task run; do not optimise yet. Three, count tokens for one turn so you feel the prefill cost; that is what makes the screen-state lesson stick. Four, find one workflow you have done three times this week; write it as a skill in ~/.claude/skills/. Five, write three things you want the agent to remember about you in a MEMORY.md and watch the next session pick them up. After step five you will have used all five primitives and the framing will be self-evident.

Deeper dives

Related, by primitive

Local

Local LLM desktop agent throughput, the number that matters is not generation tok/s

Why prefill tokens per second on the screen-state input dominate decode speed once you move from chat to an agent loop on consumer Apple Silicon.

Read

Architecture

Personal AI agent on device, the way Fazm actually ships it on a Mac

The four-table SQLite schema and the one-line wrap in ChatProvider.swift that puts your local profile in front of every model turn.

Read

Local

Run vLLM locally on Mac and plug it into an AI agent that drives any Mac app

The single Settings field that rewrites ANTHROPIC_BASE_URL so a Metal-backed local server can drive Finder, Calendar, and any signed Mac app.

Read