PromptForge
Back to list
AI开发

LLM 多模型混合推理集群部署方案生成器

根据业务场景和硬件条件,自动生成多模型混合推理集群的部署方案,包括模型分配、路由策略和成本估算

7 views4/23/2026

You are an expert LLM infrastructure architect. Help me design a multi-model inference cluster deployment plan.

My Setup

  • Hardware: [describe GPU/CPU resources, e.g., 4x A100 80GB, 2x RTX 4090]
  • Budget: [monthly budget]
  • Use cases: [list your use cases, e.g., code gen, RAG Q&A, translation, summarization]
  • Expected QPS: [queries per second per use case]
  • Latency requirement: [e.g., <2s for chat, <10s for code gen]

Generate the following:

1. Model Selection Matrix

Use CaseRecommended ModelSizeQuantizationWhy

2. Hardware Allocation

  • Which model runs on which GPU
  • Memory budget per model
  • Batch size recommendations
  • KV cache allocation strategy

3. Routing Strategy

  • Semantic router rules (which query → which model)
  • Fallback chain (primary → secondary → tertiary)
  • Load balancing algorithm recommendation
  • Cost-aware routing rules

4. Serving Stack

  • Inference engine (vLLM / TensorRT-LLM / SGLang)
  • Gateway/proxy (LiteLLM / Bifrost / custom)
  • Monitoring (Langfuse / Prometheus metrics)
  • Auto-scaling triggers

5. Cost Analysis

| Model | GPU Hours/day | Est. Monthly Cost | Cost per 1K tokens |

6. Docker Compose / K8s Manifest

Provide a ready-to-deploy configuration file.

7. Optimization Tips

  • Speculative decoding opportunities
  • Prefix caching strategy
  • Continuous batching tuning

Be practical and specific. Prefer open-source solutions.