[ platform · observability ]

Observability

Where the system shows its work. Three panels: the live agent event bus log + topology, the OTel trace explorer (NAT semantic spans), and the LLM utilization summary (hops / tokens / fallback chain).

What it is · how it works · why it matters

[ what ]

A live view of the agent system: bus event log, OTel trace explorer (NAT spans), LLM utilization (hops / tokens / fallback chain), audit decisions.

[ how ]

Every meaningful op emits an OTel span via NAT to data/traces/spans.jsonl. The "▶ Trigger end-to-end" button fires Scheduler.tick.eod and shows ~30 events across 25 agents in 5 seconds. Optional Phoenix export via docker-compose --profile phoenix.

[ why ]

Auditable decision traces are a common requirement for regulated and fiduciary workflows. Every decision is tappable; every order traces back to the spans that produced it. This is what makes the platform auditable.

Overview

NVTrader instruments every meaningful operation with OTel spans through NeMo Agent Toolkit (NAT). Spans land in data/traces/spans.jsonl (10 MB rotation). The Observability page renders three views over that data.

Agent event bus

Top of the page. Click ▶ Trigger end-to-end to fire Scheduler.tick.eod onto the A2A bus. The event log streams as the cascade plays out — roughly 30 events across 25 agents in 5 seconds.

The topology card to the right shows the agent registry. Each agent lists its subscribes and emits. Click any agent to filter the event log to events it touched.

See A2A event bus engine docs for the deep dive.

Trace explorer

Middle of the page. Lists OTel spans newest-first. Each row shows trace_id, span_id, span name (e.g. cufolio.solve · broker.alpaca.place_order · bus.publish), elapsed_ms, status, and the agent that emitted.

Click a span to open the detail drawer — full attribute dict, parent trace, child spans. Aggregations at the top: p50 / p95 / max latency per span name; throughput per minute.

LLM utilization

Bottom of the page. One row per recent chat turn:

Hops bar — how many tool-call hops the turn used vs the cap. Red if it hit the cap.
Model chain — which models were tried (e.g. kimi-k2.6 → nemotron-super-120b). Arrows highlight where the fallback fired.
Tokens in / out — input + output token counts.
Latency — wall-clock end-to-end.
Cost (est) — based on the provider's published rate card.

What to look for

signal	likely cause	action
cuFOLIO span > 1.5 s	n_scenarios too high or CPU fallback	check `device` attribute; lower scenarios.
broker span errors spike	venue rate-limit or PDT	see Orders page rationale.
Bus event count drops to 0	scheduler skipped or process crashed	check scheduler page + last server restart.
LLM hops hit cap on every turn	Kimi confused by tool result format	simplify the tool's JSON return; or raise `CHAT_MAX_HOPS`.
Fallback model fires often	primary throwing 429 or degenerating	check NVIDIA Build rate-limit dashboard; lower `frequency_penalty`.

Sending spans to your own collector

NAT writes JSONL by default. To ship to Phoenix / Tempo / Jaeger / DataDog, set OTEL_EXPORTER_OTLP_ENDPOINT in .env and bring up docker-compose --profile phoenix. The collector mirrors spans to both the JSONL file and the OTLP endpoint.

REST surface

Verb	Path	Purpose
GET	`/api/observability/traces?limit=100`	Tail spans.
GET	`/api/observability/stats`	Latency aggs per span name.
GET	`/api/observability/llm`	LLM utilization rows.
GET	`/api/bus/events?limit=200`	Bus event ring buffer.
GET	`/api/bus/agents`	Registry topology.
POST	`/api/bus/trigger`	Fire an event onto the bus.