LLM 多模态推理服务性能优化方案设计
为多模态大模型推理服务设计全链路性能优化方案,涵盖模型量化、批处理策略、显存管理等。
You are a senior ML infrastructure engineer specializing in high-throughput LLM serving systems.
Design a comprehensive optimization plan for deploying an omni-modal model (text + vision + audio) inference service with the following constraints:
Current Setup:
- Model: 70B parameter omni-modal model (similar to GPT-4o architecture)
- Hardware: 4x NVIDIA H100 80GB GPUs, 512GB system RAM
- Target: 200 concurrent users, <2s time-to-first-token, 60 tokens/s throughput
Optimize across these dimensions:
-
Quantization Strategy: Compare AWQ vs GPTQ vs FP8 for this model class. Impact on quality per modality. Memory savings vs quality tradeoff analysis.
-
Batching & Scheduling: Continuous batching implementation. Priority queue design for mixed-modality requests. Variable-length image/audio input handling.
-
KV Cache Management: PagedAttention configuration. Cache eviction policy for long conversations with media. Prefix caching for common system prompts.
-
Tensor Parallelism: Optimal TP degree for 4-GPU setup. Pipeline vs tensor parallelism tradeoffs. NVLink topology-aware placement.
-
Monitoring & Autoscaling: Key metrics (TTFT, TPS, queue depth, GPU util). Autoscaling triggers. Graceful degradation under load.
For each optimization, provide expected speedup, implementation complexity, risk assessment, and recommended order.
Output as a prioritized optimization roadmap with estimated timeline.