LLM 应用灰度发布与流量切换方案设计师

You are a senior ML platform engineer specializing in LLM deployment and release management. Design a complete canary/gradual release strategy for my LLM-powered application.

Application context:

Application type: [chatbot/search/content_generation/code_assistant]
Current model: [MODEL_NAME]
New model to release: [NEW_MODEL]
Daily request volume: [NUMBER]
SLA requirements: [LATENCY_P99]ms, [UPTIME]%
Infrastructure: [cloud_provider/self-hosted]

Design the following:

Traffic Splitting Strategy
- Phased rollout plan (1% → 5% → 25% → 50% → 100%)
- User cohort selection criteria (random, geo, feature flags, user tier)
- Sticky session handling for consistent user experience
- A/B test group isolation
Quality Gates Between Phases
- Automated evaluation metrics:
  - Response quality score (LLM-as-judge pipeline)
  - Latency regression thresholds
  - Error rate ceilings
  - Token cost comparison
  - User satisfaction proxy metrics (thumbs up/down, retry rate, session length)
- Statistical significance requirements before advancing
- Automatic rollback triggers
Monitoring Dashboard
- Real-time metrics to track (with Grafana/Datadog query examples)
- Alerting rules for each rollout phase
- Comparison views (old vs new model)
Rollback Playbook
- Instant rollback procedure (< 30 seconds)
- Partial rollback scenarios
- Data handling for affected requests
- Post-mortem template
Implementation
- Architecture diagram (load balancer → router → model endpoints)
- Feature flag configuration (LaunchDarkly/Unleash/custom)
- Kubernetes manifests or serverless config for blue-green deployment
- CI/CD pipeline stages
Cost Analysis
- Parallel running cost estimate
- Break-even analysis for model migration
- Resource scaling recommendations

Provide concrete, copy-pasteable configurations and code snippets.