How to Find the Conversations Where Your AI Agent Fails and Users Abandon
How to Find the Conversations Where Your AI Agent Fails and Users Abandon
Your AI agent handles thousands of conversations a day. Aggregate metrics look fine: 90%+ task completion, decent CSAT scores, low error rates. But somewhere in that remaining 5-10%, users are hitting a wall, getting frustrated, and leaving. Those conversations are where your product reputation erodes. The problem is finding them.
Why Aggregate Metrics Hide Failures
A 95% success rate sounds impressive until you do the math. At 10,000 conversations per day, that is 500 frustrated users daily. Many of them will never come back. And the aggregate number tells you nothing about why they failed or which conversation patterns triggered the failure.
The core issue: most agent monitoring is built around error tracking (exceptions, API failures, timeouts). But the worst user experiences are not errors in the traditional sense. The agent responds, it generates valid output, it does not throw an exception. It just gives a wrong answer, loops in circles, or misunderstands what the user actually wanted.
| Failure type | Shows up in error logs? | User impact | |---|---|---| | API timeout / 500 error | Yes | Immediate, obvious | | Malformed tool call | Yes | Workflow halts | | Wrong answer, confidently stated | No | User trusts it, acts on bad info | | Circular conversation (agent repeats itself) | No | User gives up after 3-4 turns | | Misunderstood intent, correct format | No | User gets irrelevant help | | Partial completion (agent does 3 of 5 steps) | No | User has to finish manually |
The bottom four rows are the dangerous ones. They are invisible to traditional monitoring.
Step 1: Define What Abandonment Actually Looks Like
Before you can detect failure, you need a working definition of "user abandoned." This varies by product, but common signals include:
- Session ended without resolution - the user closed the chat, navigated away, or stopped responding after the agent's last message
- Repeated rephrasing - the user asked the same question 3+ times in different words
- Escalation request - the user explicitly asked for a human ("let me talk to someone")
- Negative sentiment shift - early messages were neutral or positive, later messages became frustrated
- Short final message - the conversation ended with "nevermind", "forget it", "ok", or silence
You do not need all of these. Pick 2-3 that map to your product and instrument them.
def classify_abandonment(conversation):
signals = []
# Signal 1: session ended without task completion
if not conversation.task_completed and conversation.ended_by == "user":
signals.append("unresolved_exit")
# Signal 2: user rephrased the same intent 3+ times
intents = [msg.detected_intent for msg in conversation.user_messages]
for intent in set(intents):
if intents.count(intent) >= 3:
signals.append("repeated_rephrase")
# Signal 3: explicit escalation request
escalation_phrases = ["talk to human", "real person", "agent", "supervisor"]
for msg in conversation.user_messages:
if any(phrase in msg.text.lower() for phrase in escalation_phrases):
signals.append("escalation_request")
# Signal 4: conversation ended abruptly (< 5 chars in last message)
if conversation.user_messages and len(conversation.user_messages[-1].text) < 5:
signals.append("terse_exit")
return {
"is_abandonment": len(signals) >= 2,
"signals": signals,
"confidence": min(len(signals) / 3, 1.0)
}
Step 2: Build the Conversation Funnel
Think of every conversation as a funnel with discrete stages. Users drop off at different stages for different reasons.
The biggest drop in this example is between stages 2 and 3: the agent understood what the user wanted but took the wrong action. That tells you exactly where to focus your investigation.
Log every conversation with stage markers so you can compute this funnel weekly.
Step 3: Instrument Conversation-Level Signals
For each conversation, log these signals alongside the messages:
conversation_metadata = {
"conversation_id": "conv_abc123",
"started_at": "2026-04-08T14:22:00Z",
"ended_at": "2026-04-08T14:25:33Z",
"total_turns": 8,
"user_turns": 4,
"agent_turns": 4,
"tool_calls": 3,
"tool_failures": 1,
"detected_intents": ["refund_request", "refund_request", "escalation"],
"intent_changed": True,
"resolution_status": "abandoned",
"last_user_message": "forget it",
"time_to_first_response_ms": 1200,
"avg_response_time_ms": 2800,
"sentiment_trajectory": [0.1, -0.2, -0.5, -0.8],
"funnel_stage_reached": 2,
}
The key field is sentiment_trajectory. Even a simple rule (message length decreasing, question marks increasing, presence of frustration keywords) gives you a usable signal without needing a separate sentiment model.
Step 4: Build the Failure Detection Query
With this data logged, finding failed conversations becomes a database query, not a guessing game.
SELECT
conversation_id,
started_at,
total_turns,
resolution_status,
last_user_message,
funnel_stage_reached
FROM conversations
WHERE
resolution_status IN ('abandoned', 'escalated')
AND total_turns >= 3
AND started_at > NOW() - INTERVAL '7 days'
ORDER BY total_turns DESC
LIMIT 100;
Sort by total_turns DESC to surface the conversations where users tried hardest before giving up. A user who sends 8 messages and then abandons was more invested than someone who bounced after one message. Those high-effort abandonments are your highest-priority failures.
Step 5: Cluster Failures by Root Cause
Once you have your 100 worst conversations, you need to categorize them. Manual review of 20-30 conversations usually reveals 3-5 failure clusters.
| Cluster | Example pattern | Typical fix | |---|---|---| | Intent misclassification | User asks about billing, agent responds about account settings | Improve intent detection, add training examples | | Tool call failure cascade | First API call fails, agent retries wrong endpoint | Add fallback logic, better error recovery | | Context window overflow | Long conversation causes agent to forget early context | Summarize earlier turns, use retrieval | | Ambiguous user input | "Change my plan" could mean subscription, flight, or meal plan | Ask clarifying question before acting | | Knowledge gap | Agent does not know about a recent policy change | Update knowledge base, add retrieval source |
Tip
Run this clustering weekly. The distribution shifts as you fix things. Last week's top cluster drops, and a new one emerges. That is normal and healthy. If the same cluster stays on top for 3+ weeks, your fixes are not landing.
Step 6: Automate the Triage Loop
Manual review does not scale past a few hundred conversations. Once you know your failure clusters, build automated classifiers for each one.
def auto_triage(conversation):
"""Classify a failed conversation into a known failure cluster."""
triggers = {
"intent_misclassification": (
conversation.intent_changed
and conversation.funnel_stage_reached <= 2
),
"tool_failure_cascade": (
conversation.tool_failures >= 2
and conversation.resolution_status == "abandoned"
),
"context_overflow": (
conversation.total_turns >= 10
and conversation.funnel_stage_reached >= 3
and conversation.resolution_status != "completed"
),
"ambiguous_input": (
len(conversation.detected_intents) >= 3
and len(set(conversation.detected_intents)) >= 2
),
"knowledge_gap": (
conversation.agent_said_dont_know
or conversation.hallucination_detected
),
}
matched = [k for k, v in triggers.items() if v]
return matched[0] if matched else "uncategorized"
Pipe this into a dashboard or Slack alert. Every morning, your team sees: "Yesterday: 47 abandonments. 18 intent misclassification, 12 tool failures, 9 context overflow, 8 uncategorized."
The "uncategorized" bucket is where new failure modes hide. Review those manually each week.
Step 7: Set Up a Conversation Replay System
Numbers tell you what failed. Reading the actual conversation tells you why. Build a simple replay tool that lets you step through a failed conversation turn by turn.
What to show for each turn:
This is not a complex system. A simple web page that pulls conversation data from your database and renders it chronologically with metadata annotations is enough. The point is making it easy enough that your team actually reads failed conversations instead of just looking at dashboards.
Common Pitfalls
-
Optimizing for average metrics instead of tail failures. Your 95th percentile experience is what users remember and tweet about. A mean CSAT of 4.2 means nothing if 8% of users had a 1-star experience.
-
Only logging errors, not conversations. If you only log when something throws an exception, you miss the 80% of failures that look like normal responses. Log every turn of every conversation. Storage is cheap. Finding the failure after the user complains is expensive.
-
Using completion rate as your only success metric. An agent that marks a task "complete" after doing the wrong thing has a 100% completion rate and 0% actual success rate. You need user confirmation signals, not just agent self-reports.
-
Building a complex ML pipeline before logging the basics. You do not need a fine-tuned failure classifier on day one. You need conversation logs with timestamps, turn counts, and resolution status. Start with SQL queries and manual review. Add ML later when you know what patterns to detect.
Warning
Do not use the agent itself to evaluate whether it failed. Self-evaluation has the same blind spots as the original failure. Use external signals: user behavior, session duration, explicit feedback, and human review.
A Minimal Starting Checklist
If you are starting from zero, here is the minimum viable instrumentation:
- Log every conversation with a unique ID, timestamps, turn count, and how it ended (user closed, task completed, escalated, timed out)
- Track resolution status as a first-class field, not something you infer later
- Run a weekly query for conversations with 5+ turns and no resolution. Read 20 of them.
- Count repeated intents per conversation. If the user asks the same thing three times, that conversation failed.
- Build one dashboard showing abandonment rate over time, broken down by the entry intent
That is five things. You can build all of them in a day. The hard part is not the engineering. It is making time to actually read the failed conversations and act on what you find.
Wrapping Up
Finding the conversations where your AI agent fails is not about building sophisticated ML systems. It is about logging the right signals, querying for abandonment patterns, and then reading the actual conversations. The 5% failure rate that looks acceptable in aggregate represents real users who left frustrated. Instrument the funnel, cluster the failures, read the transcripts, and fix them one pattern at a time.
Fazm is an open source macOS AI agent. Open source on GitHub.