AI Agent Observability: How to Know What Your Agent Is Actually Doing
AI agent observability is the gap between "the agent returned an answer" and "the agent did the right thing for the right reason." Most production teams close the first part. Very few close the second — and that's where incidents happen.
This article is a practitioner's guide to observability for production AI agents. Not a tour of observability tools. Not a definition of tracing concepts. A working model for what to instrument, what to measure, and what to do when your agent behaves unexpectedly in production.
---
Why Standard Monitoring Doesn't Work for Agents
Traditional APM tells you when a service is up, how long requests take, and what error rates look like. For a deterministic service — a database query, an API call, a validation function — that's enough. You know what success looks like, so you can detect failure.
Agents break this model in three ways:
**1. The same input doesn't produce the same output.** LLMs are stochastic. The same prompt, same tools, same user query can produce meaningfully different outputs on consecutive runs. Uptime monitoring won't catch this.
**2. Failure looks like success.** An agent can return a well-formed response that is confidently wrong. No exception is thrown. No 500 status code. The response looks fine — it's just factually incorrect, scope-violating, or subtly hallucinated. Traditional monitoring passes it through.
**3. The failure point isn't the output.** In a multi-step agent, the wrong tool call in step 2 produces a plausible-looking answer in step 6. The output doesn't expose the root cause. You need step-level visibility.
These three properties mean you need a different observability model.
---
The Four Layers of Agent Observability
Think of agent observability as four layers, each catching a different class of failure.
### Layer 1: Execution Tracing
Execution tracing captures every step the agent took — not just the final output, but each LLM call, tool invocation, retrieval query, and decision branch, with timing and cost attached.
What you need in each trace record: - **Step type**: LLM call, tool call, retrieval, planning, handoff - **Input**: The exact prompt, query, or parameters passed at this step - **Output**: The raw response — not parsed, not filtered - **Latency**: Wall-clock time for this step - **Token usage and cost**: Per-LLM-call, not just session-total - **Model version**: Critical for detecting output drift when upstream model providers push updates
A trace without per-step records is a log. A trace *with* per-step records is visibility. The difference is whether you can debug a failure without reproducing it.
Implementation note: instrument at the agent framework level, not the output level. In LangGraph, LlamaIndex, or custom agent loops, attach trace spans at each node or step, not after the final response returns.
### Layer 2: Quality Scoring
Quality scoring moves you from "did the agent respond?" to "was the response correct?"
Three methods, applied at different layers:
**Deterministic checks** — rules-based validation you can run without an LLM. Useful for format compliance, length bounds, citation presence, scope violations (e.g., agent answered a question outside its defined domain). Fast, cheap, consistent.
**LLM-as-judge** — a second LLM evaluates the agent's output against a rubric: factual accuracy, coherence, relevance, instruction-following. More expensive per evaluation, but catches semantic failures deterministic checks miss. Best applied on a sample, not every call.
**Human scoring + feedback loops** — direct feedback from end users or internal reviewers, mapped back to the traces that generated the outputs. The highest-quality signal, but sparse. Use to calibrate your LLM-as-judge rubrics over time.
Run at minimum: deterministic checks on 100% of outputs, LLM-as-judge on 10–20%, and human review on flagged cases.
### Layer 3: Behavioral Drift Detection
A production agent that worked well last week may not work well this week — without any change to your code. Upstream model updates, distribution shift in user queries, and prompt degradation all cause behavioral drift.
What to track over time: - **Score distributions**: Are quality scores holding steady, drifting down, or bi-modaling (some sessions great, some terrible)? - **Tool call patterns**: Is the agent calling tools in the expected order? Is it over-relying on one tool? Skipping steps it shouldn't skip? - **Refusal rates**: For agents with safety guardrails, is the refusal rate stable or increasing? Increasing refusals sometimes signal prompt injection attempts at scale. - **Latency trends**: Not just average latency, but P95 and P99. An agent that was P95 2s and is now P95 8s has a structural problem, even if median looks fine.
Set alerts on statistical thresholds, not fixed numbers. A quality score of 0.82 means nothing without context — but a quality score that's declined 12% week-over-week is actionable.
### Layer 4: Cost Attribution
Agentic workflows are expensive in unpredictable ways. An agent that autonomously decides to make three additional LLM calls to verify its answer can cost 4x what you budgeted per session.
Track cost at: - **Per-session**: Total token spend for a single user interaction - **Per-step**: Which step in the agent's chain is driving cost - **Per-model**: If your agent uses a mix of models (large for planning, small for extraction), cost per model type - **Per-task-type**: Different query types have different cost profiles — knowing which are expensive informs both product decisions and prompt optimization
Set per-session cost ceilings. An agent that can spend unbounded tokens on a single session is an operational risk, not a feature.
---
What to Alert On (and What Not To)
One of the most common observability mistakes is alerting on everything and learning to ignore the alerts.
**Alert on:** - Quality score declining more than 10% over 7 days (drift signal) - P95 latency exceeding threshold (structural problem, not noise) - Per-session cost exceeding 2x the median (runaway agent) - Tool call failure rate above baseline (external dependency issue) - Refusal rate spike (potential prompt injection or misuse)
**Don't alert on:** - Individual low-quality sessions (noise — investigate trends, not outliers) - Minor latency variance within normal range - Every tool call — too much signal drowns the real issues
The goal of alerting is to surface conditions your team needs to act on — not to document everything the agent did.
---
Integrating Observability Into the Build Process
Production observability isn't a post-launch addition. If you instrument after you ship, you'll spend the first month flying blind.
**During development:** - Build your trace schema before writing agent logic. Define what a trace record looks like, what fields are required, how costs are attached. - Create a small eval dataset — 20 to 50 real or realistic queries with known expected outputs. Run your agent against this set on every change. - Instrument from the first commit. Retrofitting tracing is painful and often incomplete.
**Before launch:** - Define your quality rubric in writing. What does a "good" response mean for this agent's specific task? Encode this in your LLM-as-judge prompt. - Set your cost ceiling per session. Know what a runaway conversation looks like before you see one in production. - Run your eval dataset against your quality scoring pipeline. Establish your baseline before users arrive.
**After launch:** - Review trace samples weekly in the first month. Not alerts — actual traces. You'll find failure modes no alert definition anticipated. - Feed flagged cases back into your eval dataset. Production failures are your best test cases. - Re-run evals after every significant prompt change or model update. Don't assume backward compatibility.
---
A Note on Tooling
There is good observability tooling available — Langfuse, LangSmith, Arize, MLflow, Helicone, and others. They differ in deployment model (SaaS vs. self-hosted), framework compatibility, and how they handle multi-step agent traces.
What they all share: they are infrastructure for observability, not a substitute for designing your observability layer. The quality rubric, the alerting thresholds, the eval dataset — those are your responsibility, and they require understanding your agent's task deeply.
A tool that stores traces is not the same as an observability practice. Use the tools. Also build the practice.
For teams building production agents on custom stacks or with strict data residency requirements, self-hosted observability (Langfuse's Docker deployment, or OpenTelemetry into your own collector) is often the right path. You own the traces. You define the retention. You don't send conversation data to a third party.
---
How Ashtayah Labs Approaches This
When we build production AI agents for clients — whether that's a document extraction agent for a BFSI firm or a multi-step workflow agent for operations teams — observability is part of the architecture specification, not the post-launch checklist.
We define the trace schema before the agent schema. We build the eval dataset from real examples before we build the agent. We set cost ceilings and quality thresholds in writing, and we instrument from the first commit.
The result is agents that can be debugged, monitored, and improved systematically — not agents that work well in demos and unpredictably in production.
That's the difference between a prototype and a system.
---
FAQ
**What's the minimum viable observability setup for a new production agent?** Execution tracing with per-step records, deterministic output checks on 100% of responses, and per-session cost tracking. That's the floor. Anything less and you won't be able to debug failures.
**Should we build our own observability layer or use a tool like Langfuse?** For most teams, start with a self-hosted Langfuse instance or equivalent. It handles trace storage, visualization, and basic scoring infrastructure. What you can't outsource: your quality rubric, your eval dataset, your alerting thresholds. Those require domain knowledge about your specific agent.
**How do you handle observability when the agent calls external APIs?** Instrument the API call as a trace step — input, output, latency, success/failure. Treat it the same as an LLM call. External API failures are often root causes in multi-step agent failures, and you won't find them without step-level traces.
**How often should we review traces manually?** Daily for the first two weeks post-launch, weekly thereafter. Automated quality scoring and alerts will surface statistical patterns — but manual trace review catches failure modes your alert definitions don't anticipate.
**Can we add observability to an agent that's already in production?** Yes, but expect gaps. Retrofitting trace instrumentation to an existing agent often requires refactoring the control flow to support span injection. It's doable — it's just slower and messier than building it in from the start.
---
Start a system review at [ashtayahlabs.com](https://ashtayahlabs.com) — if you're building production AI agents and want a second set of eyes on your observability architecture.
Ashtayah Labs
AI Systems Team