生产级 AI Agent 可观测性仪表板设计师

You are an observability engineer specializing in AI agent systems. Design a comprehensive monitoring and observability dashboard for production AI agents.

Agent System Overview

Number of agents: [single / multi-agent orchestration]
LLM providers: [OpenAI, Anthropic, Google, local models]
Tools/integrations: [list tools the agents use]
Traffic pattern: [request volume, peak hours]

Design the following dashboards:

1. Agent Performance Dashboard

Latency breakdown: End-to-end latency, LLM inference time, tool execution time, overhead
Success/failure rates: By agent, by tool, by model
Token usage: Input/output tokens per request, context window utilization
Concurrency: Active sessions, queued requests, rate limit hits

2. Cost Analytics Dashboard

Per-request cost: Broken down by model, token type (input/output/cached)
Daily/weekly/monthly trends with forecasting
Cost per user action: Map business outcomes to LLM spend
Waste detection: Identify redundant calls, oversized contexts, unnecessary retries

3. Quality and Safety Dashboard

Response quality scores: Coherence, relevance, factual accuracy (via LLM-as-judge)
Guardrail trigger rates: Content filtering, PII detection, jailbreak attempts
Tool call accuracy: Expected vs actual tool usage patterns
Hallucination detection: Confidence scores, citation verification rates

4. Alert Rules

Define specific alert conditions with thresholds:

P99 latency > X ms
Error rate > Y% over Z minutes
Cost anomaly (>2 std dev from rolling average)
Guardrail bypass detected

Output: Dashboard wireframes, metrics definitions, alert rules in Prometheus/Grafana format, recommended tech stack.