AI Agent Observability Gaps: Why You Can't See What Your Agent Is Doing

Your observability stack probably looks healthy. Metrics are flowing, traces are connected, dashboards are green. And yet, if someone asked you exactly what your AI agent did in the last thirty minutes — which files it read, which APIs it called, what reasoning led it to take a particular action — you probably couldn't answer.

That's the observability gap. It's not a misconfigured collector or a missing span. It's a structural mismatch between how observability tooling was designed and how AI agents actually behave.

Understanding where the gaps are is the first step to closing them.

The Core Problem: Agents Don't Behave Like Services

Traditional observability was designed around services. A service receives a request, does work, returns a response. Traces capture that flow. Metrics capture its frequency and latency. Logs capture the errors. The model works because services are predictable units: defined inputs, defined outputs, defined boundaries.

AI agents break every one of those assumptions. They don't wait for requests — they initiate actions. They don't have fixed boundaries — they operate across whatever systems they have access to. They don't produce simple outputs — they make decisions, take multi-step actions, and update their behavior based on intermediate results.

An agent asked to "set up the staging environment" might read a config file, call a cloud provider API, modify a database record, write a log, send a Slack message, and start a background process — all in under two seconds, and all in response to a single prompt. The trace for that looks like five unrelated service calls. The actual chain of causation — prompt → reasoning → decision → execution — is invisible.

Three Visibility Gaps

1. The shell and command layer

Most agents interact with systems through shell commands. They run scripts, execute binaries, pipe output between processes. This layer is almost entirely invisible to standard observability tooling.

Application traces don't capture shell execution. Log aggregators see the output of commands only if the command explicitly writes to stdout or a log file. APM agents instrument code — not shell invocations. The result: a significant fraction of what an agent actually does happens in a layer that observability tooling doesn't reach.

This matters most in security incidents. If an agent runs a command that exfiltrates data, modifies system configuration, or installs a package with a vulnerability, that event often has no trace in any monitoring system. You'd know something changed, but not what or why.

2. Cross-system correlation

An agent that touches five systems in a single task generates five separate event streams, with no native correlation between them. Your cloud provider logs show an API call. Your database logs show a write. Your application logs show a config reload. Each event looks normal in isolation. None of them are labeled "this happened because the agent was executing task X."

This is the fundamental cross-system correlation problem. Standard distributed tracing solves it for services — a trace ID propagates through RPC calls and ties events together. But agent actions don't flow through RPC calls. They flow through shell commands, API clients, direct database connections, and arbitrary tool integrations. Trace ID propagation doesn't reach most of them.

Without cross-system correlation, you can reconstruct what happened in each system independently — but you can't reconstruct the causal chain across systems. That's precisely what you need during an incident or an audit.

3. In-context reasoning

The most opaque layer is also the most consequential: the agent's reasoning. When an agent takes an unexpected action, the question isn't just "what did it do" — it's "why did it do that." What information did it have? What was it trying to achieve? What alternatives did it consider and reject?

Standard observability has no answer for this. Traces capture execution. Logs capture events. Neither captures reasoning. The agent's decision-making process lives entirely in the model's context window — and when the task completes, that context is gone.

This matters because many agent failures aren't execution failures. The code ran correctly. The API call succeeded. The agent just made a bad decision — took the wrong branch, misunderstood the goal, acted on stale information. Without reasoning visibility, you can't distinguish a bad decision from a system error, and you can't understand how to prevent the same failure next time.

Why Standard APM Tools Miss the Mark

APM vendors have added "AI observability" features in response to demand. Most of them instrument the model API call — they capture the prompt, the response, the latency, the token count. That's useful. But it misses everything that happens before and after the model call.

Before the model call: what data did the agent retrieve? What tools did it invoke to gather context? What state is it operating on?

After the model call: what did the model's output actually cause? What commands ran? What systems changed? What effects persisted?

An APM tool that shows you a beautiful trace of your LLM calls but can't tell you what shell commands those calls triggered isn't giving you agent observability. It's giving you LLM API observability — which is a subset of the problem, not the whole thing.

Observability layer	Standard APM	LLM-specific APM	Agent observability
Service calls / RPC	✅ Yes	✅ Yes	✅ Yes
LLM API calls	⚠️ Partial	✅ Yes	✅ Yes
Shell / command execution	❌ No	❌ No	✅ Required
Cross-system session correlation	⚠️ RPC only	❌ No	✅ Required
Agent reasoning / decision context	❌ No	⚠️ Prompt/response only	✅ Required
Model version per action	❌ No	⚠️ Often missing	✅ Required

What Proper Agent Observability Requires

Structured command audit trails

Every command an agent executes needs to be captured at the execution layer — not by the agent itself, and not by inference from application logs. The audit trail should record: timestamp, agent session ID, the exact command or API call with parameters, the system it targeted, and the result (exit code, response status, output summary).

"Structured" matters here. Free-text logs are hard to query and easy to miss. A structured event with defined fields can be queried ("show me all commands this agent ran against the production database in the last hour"), aggregated, and alerted on. That's the difference between a log and an audit trail.

Session context propagation

Agent actions across multiple systems need to be tied together by a session or task identifier that propagates to each system. This is the same idea as distributed trace IDs — but it needs to work across layers that traces don't reach.

In practice, this means the agent runtime stamps each action with a session ID, and that ID is written to every audit log — the command log, the API call log, the database query log. When you need to reconstruct a task, you can retrieve every action associated with that session ID across all systems.

Model version tracking

Model behavior changes between versions, and sometimes between the same version at different times. If you have a production incident caused by agent behavior, you need to know which model version was running, with which system prompt, and with which tool configuration.

This sounds obvious, but it's routinely missing. Teams update model versions, change system prompts, or modify tool definitions without recording what changed when. That makes post-incident analysis — and regression testing — much harder than it needs to be.

Decision logging

When an agent makes a significant choice — which action to take, how to interpret ambiguous instructions, whether to proceed or ask for confirmation — that decision should be logged with enough context to reconstruct the reasoning.

Full decision logging at every step isn't practical — it's expensive and generates enormous volume. But logging at decision branch points (when the agent takes an action above a risk threshold, when it handles an ambiguous case, when it encounters an unexpected system state) gives you the visibility you need without overwhelming storage and review capacity.

Practical Implementation Baseline: 5 Controls

You don't need a perfect observability system before you deploy agents. But you do need a minimum baseline. These five controls are the floor:

Execution-layer command logging. Log every shell command and direct API call at the runtime level, not the application level. The log should include the full command, the agent session ID, and the timestamp. This is the single highest-value control — without it, you have no ground truth for what the agent actually did.
Session ID propagation. Generate a session ID for each agent task and pass it as a header, tag, or metadata field to every system the agent touches. It doesn't need to be perfect — even partial propagation dramatically reduces reconstruction time during an incident.
Model and configuration versioning. Record the model version, system prompt hash, and tool configuration at the start of each agent session. Store this alongside the session ID so you can retrieve it when investigating a specific session.
High-risk decision logging. Define a set of action types that require explicit logging of the agent's reasoning context — database writes, file deletions, external API calls with side effects, anything that's hard to reverse. When the agent takes one of these actions, log the prompt context and the model's stated rationale alongside the action record.
Queryable audit store. The above controls only help if the data is queryable. Write audit events to a store you can search by session ID, time range, action type, and target system. A flat log file doesn't meet this bar. A structured log shipped to any decent log management tool does.

The Honest Limitations

Even with all of the above, agent observability has real limits. They're worth naming so you're not surprised by them.

You can't fully reconstruct reasoning. Decision logs capture the context at decision points, but they're not a complete record of everything the model considered. Large language models don't expose their computation — they produce outputs. You can log the outputs and the inputs, but the path between them remains opaque.

Cross-system propagation has gaps. Session ID propagation only works for systems you control or can configure. Third-party SaaS tools, external APIs, and systems you don't own won't carry your session ID. Those actions are still logged at the agent level — you know the agent called them — but you can't see what the downstream system did with the call.

Volume is a real problem. Active agents generate a lot of events. If you log too much, you can't find what matters. If you log too little, you miss it. The right calibration depends on your agents' risk profile and your team's capacity to review. Start conservative on scope, increase as you understand the signal-to-noise ratio.

Observability is necessary but not sufficient for safety. Knowing what your agent did is not the same as preventing it from doing something harmful. Observability tells you what happened after the fact. Prevention requires approval gates, permission controls, and blast radius limits. Observability and control are complementary — don't mistake one for the other.

See exactly what your agents are doing

Expacti captures every command your AI agents run — structured, queryable, correlated to session context and model version. Shell commands, API calls, file operations: all in one audit trail. Know what happened, when, and why.

See it in action Join the waitlist