Observability
Where the system shows its work. Three panels: the live agent event bus log + topology, the OTel trace explorer (NAT semantic spans), and the LLM utilization summary (hops / tokens / fallback chain).
What it is · how it works · why it matters
A live view of the agent system: bus event log, OTel trace explorer (NAT spans), LLM utilization (hops / tokens / fallback chain), audit decisions.
Every meaningful op emits an OTel span via NAT to data/traces/spans.jsonl. The "▶ Trigger end-to-end" button fires Scheduler.tick.eod and shows ~30 events across 25 agents in 5 seconds. Optional Phoenix export via docker-compose --profile phoenix.
Auditable decision traces are a common requirement for regulated and fiduciary workflows. Every decision is tappable; every order traces back to the spans that produced it. This is what makes the platform auditable.
Overview
NVTrader instruments every meaningful operation with OTel spans through NeMo Agent Toolkit (NAT). Spans land in data/traces/spans.jsonl (10 MB rotation). The Observability page renders three views over that data.
Agent event bus
Top of the page. Click ▶ Trigger end-to-end to fire Scheduler.tick.eod onto the A2A bus. The event log streams as the cascade plays out — roughly 30 events across 25 agents in 5 seconds.
The topology card to the right shows the agent registry. Each agent lists its subscribes and emits. Click any agent to filter the event log to events it touched.
See A2A event bus engine docs for the deep dive.
Trace explorer
Middle of the page. Lists OTel spans newest-first. Each row shows trace_id, span_id, span name (e.g. cufolio.solve · broker.alpaca.place_order · bus.publish), elapsed_ms, status, and the agent that emitted.
Click a span to open the detail drawer — full attribute dict, parent trace, child spans. Aggregations at the top: p50 / p95 / max latency per span name; throughput per minute.
LLM utilization
Bottom of the page. One row per recent chat turn:
- Hops bar — how many tool-call hops the turn used vs the cap. Red if it hit the cap.
- Model chain — which models were tried (e.g.
kimi-k2.6 → nemotron-super-120b). Arrows highlight where the fallback fired. - Tokens in / out — input + output token counts.
- Latency — wall-clock end-to-end.
- Cost (est) — based on the provider's published rate card.
What to look for
| signal | likely cause | action |
|---|---|---|
| cuFOLIO span > 1.5 s | n_scenarios too high or CPU fallback | check device attribute; lower scenarios. |
| broker span errors spike | venue rate-limit or PDT | see Orders page rationale. |
| Bus event count drops to 0 | scheduler skipped or process crashed | check scheduler page + last server restart. |
| LLM hops hit cap on every turn | Kimi confused by tool result format | simplify the tool's JSON return; or raise CHAT_MAX_HOPS. |
| Fallback model fires often | primary throwing 429 or degenerating | check NVIDIA Build rate-limit dashboard; lower frequency_penalty. |
Sending spans to your own collector
NAT writes JSONL by default. To ship to Phoenix / Tempo / Jaeger / DataDog, set OTEL_EXPORTER_OTLP_ENDPOINT in .env and bring up docker-compose --profile phoenix. The collector mirrors spans to both the JSONL file and the OTLP endpoint.
REST surface
| Verb | Path | Purpose |
|---|---|---|
| GET | /api/observability/traces?limit=100 | Tail spans. |
| GET | /api/observability/stats | Latency aggs per span name. |
| GET | /api/observability/llm | LLM utilization rows. |
| GET | /api/bus/events?limit=200 | Bus event ring buffer. |
| GET | /api/bus/agents | Registry topology. |
| POST | /api/bus/trigger | Fire an event onto the bus. |