Local AI endpoint model detection: the middle layer everyone skips

Pointing an app at a local AI server is one settings field. Knowing whether a model is actually loaded behind that URL is a different problem, and almost no consumer client gets it right. This is the two-pattern explanation, plus the exact code seam in one open-source macOS agent that does the unglamorous middle work.

Matthew Diakonov, Written with AI

Published May 13, 20269 min read

Direct answer (verified 2026-05-13)

Two patterns are used in practice. Probe-first: the calling app sends GET /v1/models to the local server (LM Studio default http://localhost:1234, Ollama default http://localhost:11434), parses the data[] array, and treats an empty array as “no model loaded”. Error-translate: the app skips the probe, sends the real request, catches the upstream 400, matches on known strings like no models loaded, lms load, or econnrefused, and translates that into an actionable message. Fazm uses pattern two; the file is Desktop/Sources/Chat/ACPBridge.swift around line 2155.

Sources: LM Studio /v1/models docs, Fazm CHANGELOG.json.

The thesis: a URL is not a model

Every walkthrough that ends with “just paste http://localhost:1234 into the base URL field” is telling you half the truth. The other half is that the server at that URL can be in five different states, and the difference between them is invisible until you send an actual request. The server can be down. The server can be up with no model loaded. The server can be up with a model loaded but not the one the request asks for. The server can be up with the right model but with a wrong context length. The server can be up, with the right model, and finally answer.

A consumer app that wraps this without a detection layer treats every one of those failures as the same vague upstream error. The user sees API Error: 400 and blames the app. That is the engineering problem the rest of this page is about.

2155

“if lower.contains('no models loaded') || lower.contains('lms load') { return 'Your custom API endpoint reported no model is loaded...' }”

Desktop/Sources/Chat/ACPBridge.swift, line 2155 in the public fazm repo

Pattern one: probe /v1/models, then send

The clean version. Before any real request, the calling app issues a GET /v1/models to the configured endpoint. The OpenAI-compatible surface returns a payload of the shape { object: 'list', data: [{ id, object: 'model', ... }] }. An empty data array means the server is up but holds no weights. A non-empty array gives the app a list of model ids it can show in a dropdown, validate against, or pass through as the model field on the next chat request.

LM Studio, Ollama, vLLM, and mlx-omni-server all expose this endpoint under the OpenAI-compat surface; LM Studio also publishes a GET /api/v0/models variant that returns extra fields (state, quantization, arch) you can surface back to the user. If you are writing a new client today and you care about the first-time-setup experience, this is the right pattern.

The catch is that LM Studio with Just in Time Model Loading turned on returns every model you have ever downloaded in /v1/models, regardless of whether the model is in memory. A successful probe is not a guarantee the next real request will complete inside any particular latency budget; the first inference call against a cold-listed model still has to load the weights, and that can take tens of seconds on a 30B-class quant. So even with the probe in place, the calling app still needs the second pattern to handle the cold-load case gracefully.

Pattern two: parse the error from the first real request

The pragmatic version. Skip the probe, send the request the user actually wanted to send, and only run translation logic when the request fails. The detection layer becomes part of the error path instead of part of the happy path. No extra round trips when things work, full diagnostic context when things break. The trade is that the first failure is the only signal, so the translator has to be specific enough to actually be useful.

That is the pattern Fazm uses. The toggle in Settings → Advanced → AI Chat → Custom API Endpoint writes a string into customApiEndpoint on UserDefaults. When the bridge subprocess starts up, Desktop/Sources/Chat/ACPBridge.swift lines 516-519 reads that field and exports it as ANTHROPIC_BASE_URL in the bridge environment. From there the bridge speaks the Anthropic Messages format to whatever URL the user typed in. No probe runs.

When a real chat request fails, the bridge surfaces the upstream error message through its agentError variant. The translation layer at ACPBridge.swift lines 2148-2161 reads the same customApiEndpoint default, lowercases the upstream error, and matches against five substrings. If any of them hit, the user-facing message is replaced with a string that names the endpoint URL and the exact menu path to fix the problem.

Raw vs. translated

// What the agent surface shows by default
// (this is the literal upstream body from LM Studio's HTTP layer)

API Error: 400
{
  "error": "'messages' array must only contain objects
   with a role and content field"
}

(internal cause: No models loaded.
 Use the 'lms load' command before sending requests.)

[Send] [Retry]

18% actionable surface area

What gets matched on, and what gets missed

The five strings Fazm matches against are not a complete catalog of local-server failure modes. They are the ones that triggered real bug reports, hard enough to ship a fix for in v2.6.7. The shape of the file makes the gap visible: the function falls through to return cleaned on any unmatched error, so adding a new server vocabulary is a one-line edit, not a refactor. The current substrings target LM Studio (“no models loaded”, “lms load”, “api error”) and the POSIX/Node connect-failure surface (“connection refused”, “econnrefused”).

Ollama lives in the same gap but speaks a different dialect. When you ask Ollama to serve a model that has not been pulled, it returns model "foo" not found, try pulling it first. The current Fazm substrings would not catch that. The right patch is two lines, but the broader point is that a substring-on- error-body design has to be maintained against the vocabulary of every supported server, version, and locale. Probe-first does not have that maintenance burden, which is the strongest argument for adopting it as a complement, not a replacement.

0error substrings matched

0files touched (bridge + settings)

0UserDefault key (customApiEndpoint)

0env var (ANTHROPIC_BASE_URL)

Counterargument: just use the raw error

The strongest objection to writing any translation layer at all is that you are now responsible for keeping the translation in sync with upstream. LM Studio changes the error body shape between 0.3.x releases. Ollama renames endpoints between minor versions. vLLM ships new server modes that produce different error JSON. Every translation is one more thing that can rot, and a mis-translated error is arguably worse than the raw one because it confidently points the user at the wrong fix.

This is a real cost. The defense is the fall-through. Fazm’s translation layer only replaces the error when a known substring hits; everything else passes through cleaned but otherwise untouched. So the worst-case cost of a vocabulary drift is that the user sees the raw error, which is the baseline anyway. The best-case benefit is that the most common failure mode (LM Studio with no model loaded) gets a one-click resolution. The asymmetry is favorable, and the code is short enough that it is easy to audit when an upstream release notes call out an error-body change.

The exact code seam, for anyone copying the pattern

If you are building a similar client, the file map below is the minimum surface area. Two files, one UserDefault, one env var, five strings, one fall-through. Less code than most logging wrappers.

Bridge spawn (env-var export):
  Desktop/Sources/Chat/ACPBridge.swift:516-519

Error-translation layer:
  Desktop/Sources/Chat/ACPBridge.swift:2148-2161

Settings UI (toggle + textfield):
  Desktop/Sources/MainWindow/Pages/SettingsPage.swift:955-1000

Change-trigger (bridge restart on save):
  ChatProvider.restartBridgeForEndpointChange()

Ship vehicle:
  CHANGELOG.json line 162 (version 2.6.7,
  "Friendlier error when a Custom API Endpoint
   (e.g. LM Studio) returns no models loaded
   or refuses connection")

“No models loaded. Use the 'lms load' command before sending requests.”

LM Studio HTTP layer

literal upstream error body, copied from a real 400 response

Where this leaves the calling app

Detection is the first layer of a longer chain. The detection layer answers “is there a model behind this URL?”. The next layer answers “which one?”, which a probe- first design surfaces for free and an error-translate design has to fetch separately. The layer after that answers “is the loaded model strong enough to actually run the agent loop the user is about to start?”, which is the question the MLX-on-desktop write-up answers in more detail. None of those layers are free, and a consumer app that wants to feel like it understands the user’s local stack has to build all three.

What ships in Fazm today is the first layer plus the user-visible recovery prompt that turns a generic 400 into a specific next click. The other two are open work. The advantage of being open source is you can read the seam and decide whether to extend it.

Building a client that talks to a local model server?

If you are wiring an app to LM Studio, Ollama, or vLLM and the detection layer is the part you are stuck on, we have shipped this end to end on macOS. Happy to compare notes.

Frequently asked questions

What does 'local AI endpoint model detection' actually mean?

It means knowing, at request time, which model (if any) is loaded and addressable behind a local HTTP endpoint like http://localhost:1234 (LM Studio default) or http://localhost:11434 (Ollama default). The endpoint being reachable is not the same as a model being ready. Most consumer apps that let you 'point at a local server' check that the URL is parseable, set a base URL, and then send the first real request. If the server is up but no model is loaded, the request fails with a generic 400 the user has no way to interpret. Detection is the small layer that fills that gap, either by probing the /v1/models endpoint before the first real request or by parsing the upstream error after it.

What is the actual one-line answer to the underlying question?

Send GET /v1/models to the endpoint. LM Studio, Ollama, vLLM, mlx-omni-server, and claude-code-mlx-proxy all expose it under the OpenAI-compatible surface. The response shape is { object: 'list', data: [{ id, object: 'model', ... }] }. If data is empty the server is up but no model is loaded. If the request hangs the server is not up. If you get back a non-empty data array, that array is the list of model ids you can pass as 'model' in your next request. Fazm does not poll this endpoint; it parses the error from the first real request instead. Both approaches work. The choice is a tradeoff (see below).

Where exactly is this implemented in Fazm?

Two files. Desktop/Sources/Chat/ACPBridge.swift lines 516-519 reads the customApiEndpoint UserDefault and exports it to the bridge subprocess as the ANTHROPIC_BASE_URL env var. Desktop/Sources/Chat/ACPBridge.swift lines 2148-2161 sits in the BridgeError.localizedDescription path and inspects every agentError message. When the customApiEndpoint UserDefault is non-empty and the lowercased error contains 'no models loaded' or 'lms load', it returns a string naming the endpoint URL and pointing the user at Settings → Advanced → AI Chat. The toggle in Settings is in Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 955-1000 under the settingId 'aichat.endpoint'. The whole feature shipped in v2.6.7 according to CHANGELOG.json line 162.

Why not just poll /v1/models before every request?

Two reasons. First, it doubles round-trip count for every chat turn, and a local agent loop already sends a request every few seconds during a multi-step task. Second, it tells you nothing about whether the loaded model will actually respond to your specific request. LM Studio with 'Just in Time Model Loading' enabled returns every downloaded model in /v1/models, but the model is not in memory until the first inference request actually warms it. So a probe that says 'a model is available' can still fail on the real request with a cold-load timeout. Error-translate avoids the optimistic-probe trap by reading the actual failure that the actual request actually produced.

Why not just show the raw error?

Because the raw error blames the wrong system. The literal upstream string from LM Studio is API Error: 400 ... 'No models loaded. Run the \'lms load\' command'. To a user who just installed a macOS agent and typed one message, that error has no visible connection to LM Studio. They blame the agent. The agent gets a one-star review. This is not theoretical: it is the reason CHANGELOG.json line 162 exists. The actionable replacement names the endpoint, names the local-server vendor, and gives the exact menu path to fix it. The detection layer is the translator.

Which error strings does Fazm match on?

Five, all lowercased before matching. 'no models loaded' (LM Studio's literal error body), 'lms load' (the LM Studio CLI verb the upstream error tells you to run), 'api error' (the prefix on every LM Studio HTTP-layer error), 'connection refused' (POSIX-flavored error from anything bound to a closed port), and 'econnrefused' (Node-style equivalent, since the bridge runs as a Node child process). Match-on-substring is a lossy heuristic, so the file is structured to fall through to the raw cleaned error if none of the strings match. The strings live at Desktop/Sources/Chat/ACPBridge.swift lines 2155 and 2158.

Does this work with Ollama, vLLM, mlx-omni-server, or only LM Studio?

The actionable message is LM-Studio-shaped (it tells you to use the Developer → Load Model menu), but the underlying mechanism is endpoint-agnostic. As long as the local server returns an error body containing one of the five strings, Fazm will catch it. Ollama returns 'model "foo" not found, try pulling it first' which does not match. vLLM returns its own 'no model loaded' variants in some configurations and 'econnrefused' if the server is not bound. The right next step on a real product is to broaden the string set per-server and per-version. The mechanism is right. The vocabulary is incomplete.

Does Fazm support pointing at non-Anthropic endpoints in general?

Yes. The customApiEndpoint field is a free-text URL. Behind the scenes it is exported to the bridge as ANTHROPIC_BASE_URL, so the agent loop still speaks the Anthropic Messages API. To talk to OpenAI-shaped local servers (LM Studio, Ollama, mlx-omni-server) you need a small proxy in front of them that translates Anthropic Messages to OpenAI Chat Completions. vllm-mlx and claude-code-mlx-proxy speak /v1/messages directly. The detection layer is the same regardless of which proxy you pick: the bridge catches the error string, the user sees an actionable message.

Is the detection good enough to ship, or is it a starting point?

Honest answer: starting point. The current code does not distinguish 'server is unreachable' (connection refused) from 'server is up but no model loaded' (the 400 body) in the user-facing copy beyond a one-line branch. It does not call /v1/models proactively even once on save, which would catch the 'misconfigured URL' case earlier. It does not surface the model id back to the chat header so the user can see which weights they are actually talking to. All three are obvious next steps. The point of writing this up is to flag that the gap between 'I set a URL' and 'I know what model is behind it' is bigger than one error message, and the layer that fills it is worth building deliberately.