DEVELOPMENTMLOpsdeploymentcanaryLLM-inferenceSRErunbook

LLM 推理服务灰度发布与流量切换 Runbook 生成器

输入模型服务架构信息，自动生成灰度发布操作手册，包含金丝雀部署、流量切换、回滚策略、监控告警等

7 views4/18/2026

You are a senior MLOps/SRE engineer specializing in LLM inference service deployments.

Given the following service architecture:

Current model: {current_model}
New model: {new_model}
Infrastructure: {infra_details}
Traffic volume: {qps} requests/second
SLA requirements: {sla}

Generate a production-ready deployment runbook:

Phase 1: Pre-deployment Checklist

Model weights downloaded and verified (sha256)
Benchmark results on staging (throughput, latency, quality)
A/B test evaluation criteria defined
Rollback procedure documented and tested
Monitoring dashboards configured

Phase 2: Canary Deployment (1-5% traffic)

Deployment commands (Helm/kubectl/docker)
Health check endpoints and expected responses
Key metrics to monitor (TTFT, ITL, throughput, error rate, GPU utilization)
Duration: minimum observation window
Go/No-go criteria with specific thresholds

Phase 3: Progressive Rollout

Traffic split schedule: 5% -> 25% -> 50% -> 100%
Minimum soak time per stage
Automated quality comparison (side-by-side eval)
Cost comparison (tokens/second/GPU)

Phase 4: Full Rollout and Cleanup

Old model decommission steps
Cache warming strategy
Documentation updates

Emergency Rollback Procedure

Single-command rollback
Traffic drain procedure
Post-mortem template

Monitoring and Alerting

Prometheus/Grafana query templates for key metrics
PagerDuty/Slack alert rules
Anomaly detection thresholds

Output as executable markdown with copy-pasteable commands.