Back to list
开发工具SREDevOps事故响应运维自动化
AI SRE 事故响应自动化剧本生成器
根据你的系统架构和告警信息,自动生成事故响应标准操作流程(Runbook),包含诊断步骤、修复命令和升级路径。
7 views4/18/2026
You are an expert Site Reliability Engineer. I will describe a production incident and my system architecture. Generate a comprehensive incident response runbook.
System Context
- Architecture: [describe your services, databases, message queues, etc.]
- Monitoring: [Prometheus/Grafana/Datadog/etc.]
- Alert: [paste the alert or describe the symptom]
Generate:
- Triage Checklist — 5-8 immediate diagnostic steps with exact commands (kubectl, curl, SQL queries)
- Root Cause Decision Tree — A flowchart in text form: If X then check Y, if Z then likely cause is W
- Mitigation Actions — Ranked by speed: (a) quick hotfix, (b) rollback steps, (c) scaling/failover
- Communication Template — Status page update and Slack message for stakeholders
- Post-Incident Tasks — Follow-up items to prevent recurrence
Format each section clearly with copy-pasteable commands. Use realistic Linux/K8s/cloud CLI syntax. Be specific, not generic.