LLM research updates, April 2026Chat ObserverCHAT_OBSERVER_BATCH_SIZE = 10

April 2026 LLM research updates, graded on the one question the roundups skip: what does your agent do with any of it

GPT-6 on the fourteenth. Claude Opus 4 on the second. Gemini 2.5 Pro on the first. Llama 4 Scout with a ten-million token window. Qwen 3, Gemma 4, DeepSeek V3.2, GLM-5.1 all under open licenses. MIT's reasoning-model efficiency paper. HuggingFace's rebuilt leaderboard. Every top result for this keyword prints the same list. None describes the consumer-side pipeline that turns your reactions to those releases into persistent memory on your Mac. Fazm does, in three files, at exact line numbers.

Fazm

Published April 20, 202612 min read

See the observer run live

Turn 10 trips a constant

The line of code that decides when your reactions become memory

CHAT_OBSERVER_BATCH_SIZE = 10

acp-bridge/src/index.ts line 1156

every tenth turn pair flushes to the observer

observer reads MEMORY.md, writes one topic file, emits one typed card

if the same workflow repeats three times, it drafts a skill

0:00 / 0:05

4.9from shipped in production builds of fazm.ai since 2026-03

10 turn pairs per observer batch (acp-bridge/src/index.ts:1156)

42-line observer system prompt (ChatPrompts.swift:556-597)

Four card types: insight, pattern, skill_created, summary (line 571)

Skills auto-drafted at 3+ repeats (line 582)

Every release and paper this month, as the roundups print them

GPT-6 April 14Claude Opus 4 April 2Gemini 2.5 Pro April 1Llama 4 Scout 10MLlama 4 Maverick 400BQwen 3 Apache 2.0Gemma 4 131KMistral Medium 3DeepSeek V3.2 128KGLM-5.1 MIT 754BArcee Trinity 400BMIT smaller-model predictorHuggingFace Leaderboard v2IFEval-HardMATH-VerifyLiveCodeBench-2026Multi-turn benchmarkApache 2.0 weightsMIT-licensed MoELong-context 200K+

The rest of this page is about what one specific shipping Mac agent does while you scroll past them.

“Every April 2026 LLM research roundup is a static list. Fazm ships a third parallel Claude session that flushes every 10 turn pairs of your own reactions to those releases and turns them into MEMORY.md topic files, typed cards, and, at three repeats, an executable skill.”

acp-bridge/src/index.ts line 1156 (CHAT_OBSERVER_BATCH_SIZE), Desktop/Sources/Providers/ChatProvider.swift line 1050 (observer session), Desktop/Sources/Chat/ChatPrompts.swift lines 556-597 and 582 (skill rule)

0parallel Claude sessions warmed at launch

0turn pairs per batch (line 1156)

0lines in the observer system prompt

0card types in the enum (line 571)

0last line of the observer prompt

0verbatim batch prompt line in the bridge

Reading a roundup vs running an observer

Same hour of April 2026 reading, two outcomes

You open 'Latest LLM Releases April 2026'. You read the GPT-6 section, the Opus 4 section, the Gemma 4 section, the MIT paper summary. You close the tab. Next week your agent has no idea that you already decided you evaluate new releases on math-heavy rubrics first, or that you prefer Apache 2.0 over proprietary weights for long sessions.

Zero persistent state after you close the tab
Zero effect on your agent's behavior next session
Roundup is stale the day after the next release drops
No automation drafted, no skills, no memory

What flows through the observer, visually

Sources on the left are the kinds of turns a user runs during a research-update session. The hub is the observer prompt at acp-bridge/src/index.ts line 1185. Destinations on the right are the four concrete places an observation can land.

Turn batch, observer, memory fanout

The exact threshold, in the exact file

This is the verbatim bridge code that decides when to flush the buffer. Line 1156 sets the constant. Line 1162 checks the threshold. Line 1185 is the batch prompt the observer receives, every word intact.

acp-bridge/src/index.ts

The entire observer persona, 42 lines

Loaded once per app launch at ChatProvider.swift line 1050, never rebuilt. Every rule in the observer's behavior, including the 3+ repeats threshold that turns your April 2026 research reactions into skills, is here.

Desktop/Sources/Chat/ChatPrompts.swift

Where the third session is born

The third entry in this array is the observer. No lazy spawn, no feature flag, no conditional, just a third init block on line 1050. If the app boots, the observer exists.

Desktop/Sources/Providers/ChatProvider.swift

One full batch, turn by turn

Start at turn one, a casual 'what is new with GPT-6?' question. Ten turn pairs later the threshold trips and the observer runs. Here is every step the code takes between those two events.

From turn 1 to one new topic file in MEMORY.md

Turn 1, you ask the main chat about GPT-6

Your question and Claude's response are added to the observer buffer as two entries (role=user, role=assistant). The buffer is at 1/10 pairs. No observer call yet.

acp-bridge/src/index.ts line 1158, bufferChatObserverTurn(role, text) pushes onto chatObserverBuffer. Line 1160 computes turnPairs = Math.floor(buffer.length / 2). Line 1161 logs the ratio. Line 1162 checks if turnPairs >= 10.

Turns 2 through 9, you compare Opus 4, Gemma 4, Llama 4 Scout

Each turn pair appends two entries. At 9/10 the observer is still quiet. The main chat is unaffected, no token cost, no UI pause.

The buffer grows monotonically. No batch is flushed until the threshold trips, so the main chat's response latency is untouched.

Turn 10, threshold fires, buffer spliced atomically

Math.floor(20 / 2) === 10, the condition at line 1162 is now true, flushChatObserverBatch() runs. Line 1183 does chatObserverBuffer.splice(0), which returns all 20 entries and empties the buffer in one operation.

splice(0) is atomic. New turns arriving during the flush accumulate in a fresh empty buffer. No race.

Batch joined and wrapped in the verbatim prompt at line 1185

The 20 entries are mapped to '[role]: text' lines and joined with '\n\n'. That text is embedded in the fixed preamble at line 1185, which tells the observer to be conservative, check MEMORY.md first, and use save_observer_card.

Prompt tail: 'If you detect a repeated workflow (3+ times), draft a skill.' That last clause is what turns reactions into reusable automation.

Observer receives the batch, reads MEMORY.md

The observer's first tool call, per the system prompt at ChatPrompts.swift line 586, is always Read MEMORY.md. It checks the existing topic index before writing. This is how it avoids near-duplicate memories.

acpRequest('session/prompt', { sessionId: observer.sessionId, prompt: [{type:'text', text: prompt}] }), acp-bridge/src/index.ts lines 1215 to 1218.

Observer writes one topic file and one index update

In a typical batch about April 2026 releases, the observer writes one new file like 'llm_release_benchmark_preference.md' (with frontmatter name, description, type), then does Edit on MEMORY.md to add a single one-line pointer.

Quality over quantity: the prompt at line 590 says '- Quality over quantity. Only save things genuinely useful for future conversations.' Most batches write zero files.

save_observer_card fires, user sees a toast

The card lands in observer_activity with status 'pending'. The main chat's UI polls, marks it 'shown', and auto-accepts after a grace period. The user can deny to roll back the memory write.

Card types live at ChatPrompts.swift line 571: insight (default), pattern, skill_created, summary. Three of the four are relevant to research-update consumption.

Batch finishes, observer_poll signal sent to Swift

acp-bridge/src/index.ts lines 1224-1228 send { type: 'observer_poll', batchSize, batchTurnCount } to Swift, which queries observer_activity for new rows and surfaces them in the sidebar. Observer running flag flips back to false.

If more turns accumulated during the run, line 1237 triggers a follow-up flush after 1 second. Back-pressure handled, no loss.

The fourteen-message trace from question to toast

Fourteen internal messages across five actors. Three of them are tool calls the observer runs inside its own session. The last is the card the user actually sees.

Turn 1 'what is GPT-6' through to 'pattern' card toast

The four card types at ChatPrompts.swift:571

The enum is a single line of the system prompt. Four options. Three of them matter for consuming April 2026 research updates. Each is a different decision the observer is asked to make, not just a label.

insight (default)

A single observed fact. 'User evaluates new LLM releases on math-heavy rubrics first, benchmarks with AIME 2026 before trying SWE-Bench.' One memory file, one card.

pattern

A recurring behavior. 'User compares every April 2026 release head-to-head against Claude Opus 4 before adopting it.' Emitted when the observer sees the same shape 2-3 times.

skill_created

A 3+ repeat triggered the rule at ChatPrompts.swift line 582. The observer drafted a SKILL.md at ~/.claude/skills/{name}/ and emitted this card with body 'Created skill: {name}, {description}'.

summary

A periodic consolidation, used sparingly. One batch of twenty turns about GPT-6, Opus 4, Gemma 4, Llama 4 Scout, Qwen 3 might produce one summary card that folds five observations into one.

What the dev log looks like during one batch

Every line here is a verbatim substring you can grep out of /tmp/fazm-dev.log after a research-heavy session. The 'buffered' format is printed by line 1161 of the bridge. The 'sending batch of N messages' line is from line 1213.

fazm-dev.log, one observer batch

Ten independently verifiable claims about the observer

Every claim in this list can be confirmed by grepping the open-source Fazm tree at the file and line mentioned. None of them depends on a screenshot or a vendor claim.

Grep these lines if you want to verify the page

Third session registered at ChatProvider.swift:1050, key='observer', model='claude-sonnet-4-6'
Warmed once, in the same call as main and floating (lines 1047-1051)
Batch size constant CHAT_OBSERVER_BATCH_SIZE = 10, acp-bridge/src/index.ts:1156
Batch prompt at acp-bridge/src/index.ts:1185, fixed, includes 'conservative' rule
System prompt at ChatPrompts.swift:556-597, 42 lines, loaded once per app launch
Four card types listed at ChatPrompts.swift:571: insight, pattern, skill_created, summary
3+ repeat rule codified at ChatPrompts.swift:582, triggers SKILL.md draft
MEMORY.md + topic files as primary store (ChatPrompts.swift:563-565)
Observer runs under chatObserverRunning gate (line 1168), no overlapping batches
Observer tool calls do not stream to UI (line 1197 comment in the bridge)

LLM research roundup vs running observer

What you get from reading a static April 2026 release list, against what the Fazm observer writes to disk while you read.

Feature	Typical LLM research roundup	Fazm observer
What it is	A static blog post listing April 2026 LLM releases and papers	A running third Claude session, warmed once at ChatProvider.swift:1050
Where it runs	A CMS	Your Mac, in a Node subprocess spawned by the Fazm app
Input	Model cards and benchmark tables	Your main chat's last 10 turn pairs, batched at acp-bridge/src/index.ts:1156
Output	Prose you read once	MEMORY.md topic files + typed cards + skill drafts at ~/.claude/skills/{name}/SKILL.md
How it decides to save	Editor judgement at publish time	The 42-line prompt at ChatPrompts.swift:556-597, 'conservative, quality over quantity'
Turnover	Stale the day after a new release drops	Re-runs every 10 turn pairs, so new topics roll into memory as you react to them
Actionability	Zero, you read and close the tab	Skills auto-drafted after 3+ repeats, next session inherits them
Token cost to you	None, but zero learning	One observer prompt per 10 turn pairs, not per turn, flat with chat history size
Verifiability	Trust the editor	Every number on this page grep-able in the shipping open-source tree

Eight April 2026 releases and papers, viewed from the observer's seat

Each of these is in every LLM-research-updates roundup. What changes on this page is that none of them are the observer's input. Your reactions to them are. The observer sees text-only turn pairs, batched 10 at a time, regardless of whether the release was multimodal, 10M context, or MIT-licensed.

GPT-6, April 14

HumanEval past 95%, MATH around 85%, 40% over GPT-5.4 on coding/reasoning/agents. In the observer, one reaction turn pair like 'I want to always test new releases on AIME 2026 first' is the kind of line that becomes a pattern card on pair 20.

Claude Opus 4 + Haiku 4.5

Opus 4 at 200K with 72.1% on SWE-Bench Verified. Haiku 4.5 is the observer's model of choice on Fazm (claude-sonnet-4-6 is the main, but the observer prompt fits well under Haiku's context budget too).

Gemini 2.5 Pro

Single-prompt multimodal reasoning over video, image, audio, text. Not the observer's input, its input is plain text turn pairs, but the kind of release your reactions to shape memory.

Llama 4 Scout, 10M context

Largest window of the month. Interesting for compaction, irrelevant to the observer, which sees at most 20 turn pairs per batch, about 10K-60K tokens of turn text.

Qwen 3, Apache 2.0

Dense and MoE, 128K window. If you test it through an Anthropic-compatible proxy in Fazm's settings, the observer runs the same batch loop regardless of the underlying model.

Gemma 4 at 131K

Four variants under Apache 2.0. Observer cadence is model-agnostic, the batch size constant is set in TypeScript, not in the model.

DeepSeek V3.2, 128K

Frontier reasoning, 128K. Long chains of thought over a research comparison are exactly the kind of turn pairs that make 'conservative' the right observer rule, not 'summarize every line'.

MIT smaller-model predictor paper

A smaller model predicts the outputs of the larger reasoning LLM to cut training work. The kind of finding that, if you cite it twice in a conversation, becomes a pattern card, and if you cite it three times, becomes a skill.

Watch the observer run while we discuss April 2026 releases

Twenty minutes on a screen share. We ask the Fazm main chat about GPT-6, Opus 4, Gemma 4, and the MIT paper, hit turn ten, and watch the observer write one new file to MEMORY.md live.

Book a call →

FAQ, for the April 2026 LLM research updates angle no roundup covers

What are the headline LLM research updates from April 2026?

The roundup-style shortlist: GPT-6 launched globally April 14 (HumanEval past 95%, MATH reasoning around 85%, 40% improvement over GPT-5.4 on coding and reasoning), Anthropic's Claude 4 family on April 2 with Opus 4 scoring 72.1% on SWE-Bench Verified and a 200K context window, Google Gemini 2.5 Pro on April 1 with single-prompt multimodal reasoning over video / image / audio / text, Meta Llama 4 family on April 5 (Scout with a 10M token context window, Maverick at 400B total), Alibaba Qwen 3 dense and MoE under Apache 2.0, Google Gemma 4 at 131K, Mistral Medium 3, DeepSeek V3.2 at 128K, Zhipu GLM-5.1 MIT-licensed 754B MoE at 200K. On the research side: MIT published a method where a smaller model predicts the outputs of a larger reasoning model during training to cut compute, and HuggingFace revamped the Open LLM Leaderboard with IFEval-Hard, MATH-Verify, LiveCodeBench-2026, and a new multi-turn conversation benchmark. Every top SERP result lists these. None describes what a consumer Mac agent should do with any of it.

What is Fazm's 'Chat Observer' and why is it the angle?

The Chat Observer is a third Claude session that runs in parallel with the main chat and the floating-bar chat. It is registered at Desktop/Sources/Providers/ChatProvider.swift line 1050 inside the same acpBridge.warmupSession call that spins up 'main' (line 1048) and 'floating' (line 1049). The session uses claude-sonnet-4-6 and its systemPrompt argument is chatObserverSystemPrompt, built by ChatPromptBuilder.buildChatObserverSession at lines 1043-1046. The observer does not emit UI text. Its only outputs are memory-file edits (via the SDK's built-in memory tools) and save_observer_card calls that render as auto-accepted, user-denyable toasts. That is the consumer-side lever no LLM-research-updates roundup covers.

How often does the observer run?

Every 10 main-session turn pairs. The constant CHAT_OBSERVER_BATCH_SIZE = 10 lives at acp-bridge/src/index.ts line 1156 with the inline comment 'Send batch every N turn pairs'. bufferChatObserverTurn at lines 1158-1165 counts turns, and once Math.floor(buffer.length / 2) hits 10 it calls flushChatObserverBatch. Inside flushChatObserverBatch (lines 1170-1241) the buffer is atomically spliced out (line 1183), joined as '[role]: text\n\n[role]: text…' (line 1184), and shipped as one prompt to the observer session via acpRequest('session/prompt', {sessionId, prompt: [{type:'text', text: prompt}]}) (lines 1215-1218). Twenty turns of your reactions to GPT-6, Opus 4, Gemma 4 and the MIT efficiency paper therefore hit the observer as one batch and become one consolidated memory update, not twenty noisy writes.

What is the verbatim prompt the bridge sends the observer?

At acp-bridge/src/index.ts line 1185, verbatim: 'Here are the latest conversation turns from the main session:\n\n${batchText}\n\nAnalyze these turns. Be conservative, only save things that are genuinely significant and useful for future conversations. Skip routine queries, transient context, and near-duplicates of things already saved. Each observation in this batch must cover a distinct topic, no overlapping or closely related saves. Read MEMORY.md first to check what's already known, then use your file tools (Read, Write, Edit) to save new memories as individual topic files and update MEMORY.md. Use save_observer_card to surface important observations to the user. If you detect a repeated workflow (3+ times), draft a skill.' The word 'conservative' is load-bearing. The observer's job is explicitly not to summarize the firehose, it is to save the 1-2 signals per batch that will be useful three months from now.

What are the four card types the observer can emit?

Defined in ChatPrompts.swift line 571: insight (default), pattern, skill_created, summary. 'insight' is for a single observed fact, like 'user prefers Gemma 4 for short reasoning tasks because of latency'. 'pattern' is for a recurring behavior, like 'user compares every new April 2026 release head-to-head against Claude Opus 4 before trying it'. 'skill_created' fires when the observer has drafted a reusable workflow (see the next question). 'summary' is a periodic consolidation. Each card is written via save_observer_card(body, type) and lands in the observer_activity table with status 'pending' before the UI marks it 'shown' and auto-accepts it. The user can deny a card to roll back the underlying memory write. Line 572 says explicitly: 'NEVER write raw INSERT SQL to observer_activity, always use this tool.'

What triggers the 'skill_created' type?

The rule at ChatPrompts.swift line 582: 'Skills, when you detect a repeated workflow (3+ times), create a skill at ~/.claude/skills/{skill-name}/SKILL.md. Check existing skills first. After creating: save_observer_card(body: "Created skill: {name}, {description}", type: "skill_created").' In the April 2026 research-update case this means: if you ask the observer to compare new releases three times, on the third it drafts a skill called something like 'compare-llm-release' with a SKILL.md that captures the steps it saw you run. The next time a model drops in May, that skill is already on disk and executable. This is how the observer converts your reactions to the April 2026 release firehose into reusable, local automation, not into a static reading list.

What is save_observer_card's signature?

From ChatPrompts.swift line 570, verbatim: 'save_observer_card(body: "Saved: user prefers dark mode", type: "insight")'. Body is freeform prose (the conclusion, not the narration), type is one of the four enum values. The tool wraps an INSERT into observer_activity with status 'pending', type set from the enum, and a timestamp. The observer prompt is explicit at line 594: 'One memory + one card per observation. Conclusions not narration: "Prefers X" not "I noticed X".'

How many memory writes per batch, typically?

The prompt at ChatPrompts.swift lines 588-596 is explicit: 'Quality over quantity. Only save things genuinely useful for future conversations. Do NOT save: routine queries, things already handled, temporary debug context, session-only info. DO save: personal preferences, recurring patterns, important relationships, life events, professional context, communication style. Always check MEMORY.md first, skip near-duplicates. One memory + one card per observation. Conclusions not narration.' In practice a batch of 10 turn pairs about April 2026 releases will result in zero, one, or two memory writes. The observer is designed to stay quiet. A GPT-6 reaction that is a routine question ('what is GPT-6') produces nothing. A reaction that reveals a preference ('I always want to benchmark new releases on a math-heavy rubric') produces exactly one memory file and one insight card.

Where does the observer's memory actually live?

The observer uses the Claude Agent SDK's built-in persistent memory at MEMORY.md + topic files, not a separate database. Line 563-565 of the system prompt says: 'You have a built-in persistent memory system (MEMORY.md + topic files). Use it directly, this is your primary tool. Read MEMORY.md first to check what's already known, then use your file tools (Read, Write, Edit) to save new memories and update the index. Follow the built-in memory format and rules exactly as documented in your system context.' Observer activity cards still land in the observer_activity SQLite table (line 608 of ChatPrompts.swift tableAnnotations, visible to the main chat via execute_sql), but the actual memory substrate is the file system. That is why memory survives across app versions and can be grepped like any markdown tree.

How does the observer coexist with the main chat without racing?

Three guards. (1) The observer is warmed once, alongside main and floating, in a single acpBridge.warmupSession call at ChatProvider.swift lines 1047-1051. There are no lazy spawns. (2) chatObserverRunning (acp-bridge/src/index.ts line 1168) is a boolean gate, and line 1172-1175 short-circuits if a batch is already running, with a setTimeout at line 1238 to retry in 1 second once the previous batch finishes. (3) Observer notifications register a per-session handler (line 1192) so observer tool calls never bleed into the main chat's UI stream. The main chat and floating chat see zero text from the observer by design, the bridge comment at line 1197 is 'Log chat observer tool calls for debugging but don't send to Swift UI'.

What does the observer do that stock memory features do not?

Three things. First, it is batched at a fixed cadence (10 turn pairs), not per-turn. That cadence is what makes 'conclusions not narration' possible, the observer sees a topic play out across turns before it decides to save. Second, it is scoped to a dedicated session whose system prompt is 42 lines long and is loaded once per app launch, so token cost stays flat as your chat history grows. Third, it has the skill-drafting rule (3+ repeats), which turns the memory layer into an executable workflow layer, not just a retrieval layer. None of those three behaviors is present in Claude.ai, Claude Desktop, or generic 'memory' features in other agents as of April 2026.

How do I verify every fact on this page?

Clone github.com/mediar-ai/fazm, then: grep -n 'CHAT_OBSERVER_BATCH_SIZE' acp-bridge/src/index.ts (expect hit at line 1156). grep -n 'chatObserverSession' Desktop/Sources/Chat/ChatPrompts.swift (expect the const at line 556, the Types line at 571, the skill rule at line 582). grep -n 'key: "observer"' Desktop/Sources/Providers/ChatProvider.swift (expect line 1050). wc -l on the chatObserverSession constant: from line 556 to 597 inclusive, that is 42 lines including the closing triple-quote. Run grep -n 'save_observer_card' Desktop/Sources/Chat/ChatPrompts.swift to see both the tool definition and the four enum values. Every claim on this page is pinned to a line number in a file that ships inside the signed Fazm DMG.

Adjacent shipping behaviors

Keep reading

Sibling

Large language model research updates, April 2026: how a shipping Mac agent compacts its context

The compaction side of the same shipping agent. Where this page is about the observer, that one is about what happens when the main session's context window fills.

Read

Sibling

Open-source LLM release news, April 2026: what happens inside a shipping Mac agent when the 128K window fills

The 268-line patched ACP entry point that forwards compact_boundary, rate_limit_event, and six more classes the stock agent drops.

Read

Companion

AI tech news last 24 hours: the observer's 42-line system prompt in detail

A deeper walk through ChatPrompts.swift lines 556-597 and how the four card types route different kinds of observations.

Read

April 2026 LLM research updates, graded on the one question the roundups skip: what does your agent do with any of it

Reading a roundup vs running an observer

Same hour of April 2026 reading, two outcomes

What flows through the observer, visually

Turn batch, observer, memory fanout

The exact threshold, in the exact file

The entire observer persona, 42 lines

Where the third session is born

One full batch, turn by turn

From turn 1 to one new topic file in MEMORY.md

Turn 1, you ask the main chat about GPT-6

Turns 2 through 9, you compare Opus 4, Gemma 4, Llama 4 Scout

Turn 10, threshold fires, buffer spliced atomically

Batch joined and wrapped in the verbatim prompt at line 1185

Observer receives the batch, reads MEMORY.md

Observer writes one topic file and one index update

save_observer_card fires, user sees a toast

Batch finishes, observer_poll signal sent to Swift

The fourteen-message trace from question to toast

The four card types at ChatPrompts.swift:571

insight (default)

pattern

skill_created

summary

What the dev log looks like during one batch

Ten independently verifiable claims about the observer

LLM research roundup vs running observer

Eight April 2026 releases and papers, viewed from the observer's seat

GPT-6, April 14

Claude Opus 4 + Haiku 4.5

Gemini 2.5 Pro

Llama 4 Scout, 10M context

Qwen 3, Apache 2.0

Gemma 4 at 131K

DeepSeek V3.2, 128K

MIT smaller-model predictor paper

Watch the observer run while we discuss April 2026 releases

FAQ, for the April 2026 LLM research updates angle no roundup covers

Keep reading

Large language model research updates, April 2026: how a shipping Mac agent compacts its context

Open-source LLM release news, April 2026: what happens inside a shipping Mac agent when the 128K window fills

AI tech news last 24 hours: the observer's 42-line system prompt in detail

Comments (••)

Comments ()