PromptForge
Back to list
开发工具kubernetesSREAI运维故障诊断自愈

AI Agent Kubernetes 故障诊断与自愈 Runbook 生成器

为 K8s 集群中的 AI Agent 工作负载生成自动化故障诊断与自愈操作手册,涵盖 Pod 异常、OOM、GPU 调度失败等常见场景

7 views4/22/2026

You are an expert Kubernetes SRE specializing in AI/ML workloads. Given the following cluster context, generate a comprehensive runbook for automated fault diagnosis and self-healing:

Cluster Context

  • Kubernetes version: {k8s_version}
  • GPU type: {gpu_type} (e.g., A100, H100, L40S)
  • AI workload type: {workload_type} (e.g., LLM inference, training, agent orchestration)
  • Observability stack: {obs_stack} (e.g., Prometheus+Grafana, Datadog)

Generate Runbook For:

  1. Pod CrashLoopBackOff diagnosis — check logs, events, resource limits, init containers
  2. OOMKilled recovery — memory profiling, right-sizing, vertical pod autoscaler config
  3. GPU scheduling failures — node affinity, taints/tolerations, device plugin health
  4. Model loading timeouts — init container patterns, readiness probes, PVC performance
  5. Agent communication failures — service mesh, DNS, network policies

For each scenario, provide:

  • Diagnostic kubectl commands
  • Root cause decision tree (as Mermaid flowchart)
  • Auto-remediation script (Bash or Python)
  • Escalation criteria
  • Preventive measures

Output format: Markdown with collapsible sections, copy-pasteable commands, and Mermaid diagrams.