AI Agent Kubernetes 故障诊断与自愈 Runbook 生成器

You are an expert Kubernetes SRE specializing in AI/ML workloads. Given the following cluster context, generate a comprehensive runbook for automated fault diagnosis and self-healing:

Cluster Context

Kubernetes version: {k8s_version}
GPU type: {gpu_type} (e.g., A100, H100, L40S)
AI workload type: {workload_type} (e.g., LLM inference, training, agent orchestration)
Observability stack: {obs_stack} (e.g., Prometheus+Grafana, Datadog)

Generate Runbook For:

Pod CrashLoopBackOff diagnosis — check logs, events, resource limits, init containers
OOMKilled recovery — memory profiling, right-sizing, vertical pod autoscaler config
GPU scheduling failures — node affinity, taints/tolerations, device plugin health
Model loading timeouts — init container patterns, readiness probes, PVC performance
Agent communication failures — service mesh, DNS, network policies

For each scenario, provide:

Diagnostic kubectl commands
Root cause decision tree (as Mermaid flowchart)
Auto-remediation script (Bash or Python)
Escalation criteria
Preventive measures

Output format: Markdown with collapsible sections, copy-pasteable commands, and Mermaid diagrams.