AI Agent Performance Monitoring with OpenTelemetry

Written by Dennis | May 28, 2026 7:00:00 AM

AI agent performance monitoring is the discipline of measuring how well an autonomous agent actually completes its work — not just whether it returned a 200, but how long it spent reasoning, how many tool calls it made, how often those tool calls failed, and how much that whole conversation cost. Traditional APM gives you none of those. This post shows what proper AI agent observability looks like in the AgentMon platform, built on top of the OpenTelemetry standard.

A slow service is a familiar problem. APM has been catching that one since 2010. You look at the trace, you find the slow span, you index a column, you go home.

A slow agent is a different problem.

An agent that took 11.5 seconds at P99 to do its job did not necessarily spend that time on a single slow API call. It might have spent it reasoning. It might have spent it backtracking through a plan that didn’t work. It might have spent the first nine seconds calling a tool that failed 32% of the time and the last two seconds finally getting an answer it could use.

If you’ve ever debugged that, you know it lives in a different place than “the database is slow.” This is part three of our four-part series on AgentMon, and we’re going to talk about the unique performance signatures of AI agents and how to actually find them. The numbers in this post are pulled from the live amon-test environment so you can see the real shape of the data.

The four golden signals — but for AI agent performance monitoring

We are not against the four golden signals. We are saying they are not enough. For a stateless HTTP service, rate / errors / duration / saturation is most of what you need. For an agent, you need those plus three more:

Reasoning depth — how many steps the agent took before it stopped reasoning. A request that took 156 thought-and-tool steps is interesting as a request, regardless of how fast each step was.
Backtracks — how many times the agent abandoned a sub-goal and started over.
Tool error rate (per tool) — because a 32% error rate on a tool the agent calls every other step is a very different problem from a 32% error rate on a tool it almost never calls.

AgentMon’s Performance Lens collapses both worlds — the traditional and the agent-specific — into one strip:

A few things to notice here. The dollar figure is part of the golden signals. We made that choice on purpose. For an LLM-backed agent, cost is a first-class performance dimension — a P99 latency improvement that triples your token spend is not actually an improvement, and you should be able to see those two numbers next to each other instead of in different tabs.

The Slow Spans column on the right is the bridge to the next layer. Each row is a span that just exceeded P99 — tool_use at 800ms in this snapshot. Click any of them and you’re in the trace.

Distributed tracing for AI agents, without the assembly

Spans are nice. The reason most people don’t look at them is the operational tax: somebody has to keep an OTLP collector working, somebody has to keep the trace store from filling up, somebody has to index the right attributes so you can actually search. AgentMon is OpenTelemetry-native — we take OTLP, we don’t ask you for a proprietary SDK or agent code change — but we also do the not-glamorous work of turning that stream into a queryable surface:

The columns matter. Every row is one span. Every span carries the model it was using, the tool it was calling, the duration, the cost, and the status. You can filter by any of them. You can group by trace ID. You can ask “show me every chat span that took more than 2 seconds and cost more than 5 cents,” and you get an answer in the same screen instead of switching to a different product.

In the snapshot above, notice the cost column is mostly empty for the local-tool spans (NotebookEdit, Bash, Glob, Read) and populated for the chat spans (1.8s / $0.013, 2.8s / $0.018, 3.2s / $0.052, 5.9s / $0.011, 6.7s / $0.040). That asymmetry is a feature: cost attribution travels with the span that actually incurred the cost, and the parent reasoning that orchestrated those tool calls inherits it in roll-ups.

The new layer: the Reasoning Explorer

Here is the part of AI agent performance monitoring you have not seen in any APM tool you’ve used before. This is the Reasoning Explorer:

Three leaderboards that nobody else gives you:

Longest reasoning chains answers “which agents are thinking too hard about what they’re doing?” 154 reasoning steps for initech/recruiting-svc is not a bug per se, but it is a strong signal that either (a) the task was actually too complex for one session and should have been decomposed, or (b) the agent didn’t have the tools it needed and was thrashing.

Most backtracks is the canary for stuck agents. A backtrack is when the model abandons a sub-goal and starts over on something else. Three of those in one session is fine. Three of those across every session for a given agent suggests the agent’s plan-of-record is broken.

Failed subgoals is the cleanest possible signal. The agent set out to do something. It told us, in its own intermediate reasoning, that it could not. We surface that here. Currently empty on amon-test, which is what you want.

Click one of those entries and you’re in the step explorer:

156 steps. The first thought took 19 minutes and 38 seconds. The first tool call was a WebSearch that ran in 164ms. Then a Write, a NotebookEdit, a Glob, a MultiEdit, an Edit, another WebSearch — you can see the agent feeling its way around the problem. That very first thought, the one that took nearly 20 minutes, is where the time went. It is not a slow database, it is not a slow tool — it is a model spending 20 minutes thinking through something before doing the first useful work. That diagnosis is invisible in a normal trace viewer.

Tools and MCP: where the latency actually lives

Once you know the agent is fine but the tools are slow, the Tools & MCP page gives you the bird’s-eye:

The headline numbers are unflattering on purpose. 96,759 tool calls in the window, 32.98% fleet-wide error rate, and the single most expensive tool is Edit at $182 in the period. The scatter plot is doing real work here: it positions each tool by its average latency on the x-axis and its error rate on the y-axis, with a 5% threshold line drawn in. Anything above that line is a tool you are calling that is actively failing too often.

The Languages and Time-of-Day cards underneath are the same kind of signal but for fleet shape. If the agents are spending all their effort in Python at 3 a.m., that is a fact worth knowing — for capacity planning, for tool support tiers, and sometimes for security.

Analytics: AI agent observability for ad-hoc questions

For everything we’ve described so far, AgentMon takes a strong stance: most operators don’t want a query language, they want a dashboard. But sometimes you want a query language, and we ship one. The Analytics page is the search surface over every span, every tool call, every cost line, in the fleet:

Scroll one screen down and you get the breakdowns that performance investigations live and die on:

You can see the conversation a performance lead would have with this page. Sonnet is doing 70% of the work and costing 80% of the spend. Tool latencies are tightly clustered around 400ms, which means no single tool is slow — meaning the slow agents are not slow because of one bad dependency, they’re slow because they are calling tools dozens or hundreds of times in a single session. That diagnosis re-frames the engineering work: it is not “make Edit faster,” it is “make the agent call Edit fewer times.”

What AI agent performance monitoring means in AgentMon

For every agent in the fleet, AgentMon answers: how long did it take, how many tool calls it made, how much did it cost, how much did it think, and was any of that the actual reason it was slow?

Golden signals plus reasoning depth plus backtrack count plus per-tool latency and error breakdowns, all on an OpenTelemetry-native trace stream you don’t have to teach. That is what an APM for AI agents has to be.

The model is going to think for as long as you let it. Observability is how you decide what “as long as you let it” means.

Frequently asked questions about AI agent performance monitoring

What is AI agent performance monitoring?

AI agent performance monitoring is the practice of measuring how well an autonomous agent completes a task, end-to-end. That includes the four golden signals (rate, errors, latency, saturation) plus three signals unique to agents: reasoning depth, backtracks, and per-tool error rate. Without those last three, you can’t tell why an agent is slow.

How is AI agent performance monitoring different from traditional APM?

Traditional APM watches HTTP requests, database queries, and external service calls. Agents add a layer above that: the reasoning loop, where the model decides what to do next. A slow agent often isn’t slow because any one tool call is slow — it’s slow because it’s calling tools dozens or hundreds of times. APM cannot see that pattern.

Does AgentMon work with OpenTelemetry?

Yes — AgentMon is OpenTelemetry-native. You point your existing OTLP traffic at AgentMon’s collector and it ingests it directly. No proprietary SDK, no agent code changes, no vendor lock-in.

What’s a “reasoning chain” in AgentMon?

A reasoning chain is the sequence of THOUGHT spans and TOOL spans an agent emits during a single session. AgentMon’s Step Explorer renders the whole chain inline so you can see exactly where the agent spent its time — which step was a 19-minute thought, which was a 700ms file write, which was an unproductive backtrack.

How does AgentMon measure agent cost?

Cost attribution travels with each span. LLM-call spans carry their own token count and dollar figure, while local-tool spans (file reads, shell calls) have no LLM cost but still roll into the parent reasoning’s cost. The result: you can see total cost per session, per agent, per model, or per tool, all from the same trace data.

Next in the series: AI Agent Cost Monitoring: Track Spend Before It Spirals. The final post in the series — the one your finance lead is going to want to read first. Burn rate, top wasters, token-level chargebacks, oversized prompts, and the model decisions that are quietly costing you a fortune.

Start with AgentMon — free · See the product · Read part 1: AI Agent Risk Monitoring

View full post