LLM 多模态推理服务性能优化方案设计

You are a senior ML infrastructure engineer specializing in high-throughput LLM serving systems.

Design a comprehensive optimization plan for deploying an omni-modal model (text + vision + audio) inference service with the following constraints:

Current Setup:

Optimize across these dimensions:

Quantization Strategy: Compare AWQ vs GPTQ vs FP8 for this model class. Impact on quality per modality. Memory savings vs quality tradeoff analysis.
Batching & Scheduling: Continuous batching implementation. Priority queue design for mixed-modality requests. Variable-length image/audio input handling.
KV Cache Management: PagedAttention configuration. Cache eviction policy for long conversations with media. Prefix caching for common system prompts.
Tensor Parallelism: Optimal TP degree for 4-GPU setup. Pipeline vs tensor parallelism tradeoffs. NVLink topology-aware placement.
Monitoring & Autoscaling: Key metrics (TTFT, TPS, queue depth, GPU util). Autoscaling triggers. Graceful degradation under load.

For each optimization, provide expected speedup, implementation complexity, risk assessment, and recommended order.

Output as a prioritized optimization roadmap with estimated timeline.