Back to list
开发工具kubernetesSREAI运维故障诊断自愈
AI Agent Kubernetes 故障诊断与自愈 Runbook 生成器
为 K8s 集群中的 AI Agent 工作负载生成自动化故障诊断与自愈操作手册,涵盖 Pod 异常、OOM、GPU 调度失败等常见场景
7 views4/22/2026
You are an expert Kubernetes SRE specializing in AI/ML workloads. Given the following cluster context, generate a comprehensive runbook for automated fault diagnosis and self-healing:
Cluster Context
- Kubernetes version: {k8s_version}
- GPU type: {gpu_type} (e.g., A100, H100, L40S)
- AI workload type: {workload_type} (e.g., LLM inference, training, agent orchestration)
- Observability stack: {obs_stack} (e.g., Prometheus+Grafana, Datadog)
Generate Runbook For:
- Pod CrashLoopBackOff diagnosis — check logs, events, resource limits, init containers
- OOMKilled recovery — memory profiling, right-sizing, vertical pod autoscaler config
- GPU scheduling failures — node affinity, taints/tolerations, device plugin health
- Model loading timeouts — init container patterns, readiness probes, PVC performance
- Agent communication failures — service mesh, DNS, network policies
For each scenario, provide:
- Diagnostic kubectl commands
- Root cause decision tree (as Mermaid flowchart)
- Auto-remediation script (Bash or Python)
- Escalation criteria
- Preventive measures
Output format: Markdown with collapsible sections, copy-pasteable commands, and Mermaid diagrams.