AI Coding Tools

When Your AI Coding Tool Gets Worse: Reliability, Trust, and Redundancy

Every few months a chunk of AI coding Twitter lights up with the same thread. "Is it just me or is this tool way worse this week?" Half the replies agree. The vendor posts a short message about context management optimizations or capacity constraints. The quality quietly recovers a few days later. No one learns anything actionable from it. This guide is about how to evaluate AI coding tool reliability honestly, how to handle the communication gap from vendors, and how to build a workflow that still works when one of your tools has a bad week.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Fully open source.”

fazm.ai

1. Why "It Feels Worse" Is Hard to Verify

The human ability to judge the quality of an AI coding tool from day to day is pretty terrible. You remember the frustrating runs more vividly than the smooth ones. You compare yesterday's warm-up prompt against today's complex task and conclude the model is dumber. A new engineer joins the team and files wild bug reports because they are using the tool differently than you do. None of this means nothing is happening on the vendor side. It means that without data you cannot tell the difference between a real regression and a vibes regression.

Meanwhile, some regressions are real. Context window reductions ship without a changelog. Routing changes can quietly move your requests to a cheaper, faster, less capable variant. Rate limits get tightened. Tool use schemas change. System prompts are updated. Vendors ship and roll back these changes constantly because the product is still being built in public.

The uncomfortable truth is that the only way to know if a tool is actually worse is to measure it against a stable baseline you control. Twitter consensus is useful as a heads-up that something might be up. It is not a substitute for running the same prompt with the same inputs and comparing today's output to last week's.

2. The Shape of Silent Regressions

Silent regressions in AI coding tools tend to take a few recognizable shapes. Recognizing them is the first step toward diagnosing them.

Context compression kicks in earlier. You notice the model "forgetting" things you mentioned twenty messages ago that it used to remember for a hundred. This usually means the vendor tightened the summary threshold for cost reasons. Your total context window looks the same but the active retention is shorter.

Tool use gets flakier. The model starts producing malformed tool calls, forgets to use tools that used to be obvious choices, or burns steps on unnecessary reads. This often correlates with system prompt changes on the vendor side, especially when new tools are introduced and older ones get deprioritized.

Reasoning depth drops. Tasks that used to get multi-step plans now get one-shot answers. This can be the result of a routing change, a default "effort" knob being lowered, or a quiet switch to a distilled variant of the model. Claude Code and similar products have shipped changes that default to a lower effort level silently.

Capacity-related throttling. During peak hours your requests get queued or slowed down, and the model itself might be swapped to a faster variant under load. This is the textbook case of "the tool is worse on weekdays at 2 PM Pacific and fine at 3 AM." It is a real phenomenon, not a hallucination.

None of these are bad faith on the vendor side. They are all reasonable operational decisions. They also all affect your experience. You are within your rights to notice and respond.

Automate the parts AI coding tools do not handle

Fazm takes over machine control, form filling, and cross-app workflows that Claude Code was not designed to do.

3. The Vendor Communication Gap

There is a structural reason AI tool vendors communicate badly about regressions. Saying "we made the model faster and cheaper at the cost of some quality" is a terrible launch post. Saying "we are seeing elevated load and optimizing accordingly" is vague enough to sound harmless. The incentive is to say as little as possible until someone makes noise, then say the minimum.

This is not unique to AI. Every cloud service has a trust and safety statement that underspecifies what happened during an outage. What is different with AI tools is that the product is largely a black box to the customer. You cannot instrument it the way you instrument a database or an HTTP API. You cannot see which model variant is serving your request. You usually cannot get a diff of the system prompt that was in use last week versus today.

Practical implication: the official status page is a lagging and partial indicator. Community channels (Discord, Reddit, Twitter) catch changes earlier. Your own eval harness catches them earliest. If a tool is central to how you work, you want all three.

4. A Simple Personal Eval Harness

You do not need an academic benchmark to detect regressions. You need five to ten prompts that represent the work you actually care about, run on a fixed cadence, with the outputs saved. Run the same set once a week and diff against last week.

Include tasks the tool usually nails (so you notice when it stops nailing them), tasks that are borderline (so you can see the edge move), and a small number of known-hard tasks (so you can see if headroom improves). Use tasks with objective success criteria where possible: a test that passes or fails, a file that matches an expected hash, a refactor that compiles.

Keep the prompts dumb and simple. The moment you start tweaking prompt wording to "fix" a regression you have lost the experimental control. The point of the eval is to catch changes in the tool, not to optimize your own prompts.

Run the eval on a stable machine at a stable time. The tool itself might behave differently at different hours. If you always evaluate at 10 PM local time, at least you are comparing apples to apples.

Log the model version or variant if the vendor exposes it, and log the token counts. Many regressions show up first as "completes the task but uses 3x more tokens" before they show up as "fails the task."

5. Redundancy: Multi-Model Fallback and Desktop Agents

Once you have a way to detect regressions, the next step is making your workflow resilient to them. This is mostly about having more than one path for any given task.

At the model layer, keep accounts with at least two providers and know how to switch between them. Anthropic, OpenAI, Google, and the open-weight models served through Together, Groq, or Fireworks are all viable. Tools like OpenRouter, litellm, and Aider make it easy to point at a different backend in 30 seconds. You do not need to use all of them daily. You need the muscle memory to switch when one vendor has a bad week.

At the workflow layer, recognize that AI coding tools are not the whole workflow. They write code. They do not fill out vendor portals, move files between apps, update CRM records, or run the manual QA pass across three browser tabs. Those steps are where most of a working day disappears, and they are immune to Claude Code having a bad morning.

A desktop agent covers the gap between "the code is written" and "the task is done." Tools like Fazm, UI.Vision, or a carefully written set of Keyboard Maestro macros can handle the repetitive GUI work. When your coding tool is degraded, the ability to keep the rest of the pipeline running is what keeps the day from turning into a write-off.

Desktop agents have their own reliability concerns, covered in other guides on this site. The short version: prefer the ones that use accessibility APIs over the ones that rely on vision and pixel matching. Fazm, for example, queries the macOS accessibility tree directly, which holds up across theme changes, font-size tweaks, and minor UI updates in the apps it controls.

6. Earning Trust Back

Tools go through bad weeks. That is not a reason to delete them. The question is whether the relationship is honest over a period of months. A vendor that eventually posts a clear note about what changed is one worth continuing to rely on. A vendor that denies any change ever happened, then quietly fixes it, is on thin ice. A vendor that keeps shipping improvements and is clear about trade-offs is the one you invest in long-term.

Your side of the relationship is measuring. If you cannot tell whether a tool improved or got worse across a quarter, you cannot make an informed decision about where to put your workflow weight. Thirty minutes a week on a small eval harness is cheap insurance.

The broader lesson from every "is the AI getting worse?" cycle is the same: do not build a workflow that depends on a single AI tool being at peak performance every day. Build one that assumes any given tool might be degraded on any given day, and that has at least one fallback path for the critical work. Multi-model fallback at the coding layer, a reliable desktop agent for the machine-control layer, and enough personal eval data to tell real regressions from vibes. That is what survives the next bad week.

Cover the steps your AI coding tool cannot

Fazm automates the Mac work outside the editor: form filling, cross-app data entry, and repetitive desktop tasks.

Open source. Runs locally. Works with any Mac app through accessibility APIs.