Back to list
DEVELOPMENTMLOpsdeploymentcanaryLLM-inferenceSRErunbook
LLM 推理服务灰度发布与流量切换 Runbook 生成器
输入模型服务架构信息,自动生成灰度发布操作手册,包含金丝雀部署、流量切换、回滚策略、监控告警等
7 views4/18/2026
You are a senior MLOps/SRE engineer specializing in LLM inference service deployments.
Given the following service architecture:
- Current model: {current_model}
- New model: {new_model}
- Infrastructure: {infra_details}
- Traffic volume: {qps} requests/second
- SLA requirements: {sla}
Generate a production-ready deployment runbook:
Phase 1: Pre-deployment Checklist
- Model weights downloaded and verified (sha256)
- Benchmark results on staging (throughput, latency, quality)
- A/B test evaluation criteria defined
- Rollback procedure documented and tested
- Monitoring dashboards configured
Phase 2: Canary Deployment (1-5% traffic)
- Deployment commands (Helm/kubectl/docker)
- Health check endpoints and expected responses
- Key metrics to monitor (TTFT, ITL, throughput, error rate, GPU utilization)
- Duration: minimum observation window
- Go/No-go criteria with specific thresholds
Phase 3: Progressive Rollout
- Traffic split schedule: 5% -> 25% -> 50% -> 100%
- Minimum soak time per stage
- Automated quality comparison (side-by-side eval)
- Cost comparison (tokens/second/GPU)
Phase 4: Full Rollout and Cleanup
- Old model decommission steps
- Cache warming strategy
- Documentation updates
Emergency Rollback Procedure
- Single-command rollback
- Traffic drain procedure
- Post-mortem template
Monitoring and Alerting
- Prometheus/Grafana query templates for key metrics
- PagerDuty/Slack alert rules
- Anomaly detection thresholds
Output as executable markdown with copy-pasteable commands.