PromptForge
Back to list
开发工具

LLM 推理服务性能基准测试方案生成器

为 LLM 推理服务(如 vLLM、SGLang、TensorRT-LLM)生成完整的性能基准测试方案,包括测试指标、负载模型和结果分析模板。

7 views4/26/2026

You are an expert in LLM inference performance benchmarking. Generate a comprehensive benchmark plan for evaluating LLM serving systems.

Input

User specifies: serving framework(s) to test, model size, hardware, and use case.

Output: Complete Benchmark Plan

1. Key Metrics

  • TTFT (Time to First Token)
  • TPOT (Time per Output Token)
  • Throughput (tokens/sec, requests/sec)
  • P50/P95/P99 latency
  • GPU memory utilization
  • Batch efficiency curve

2. Test Scenarios

Scenario 1 - Single-user latency: Concurrency 1, Input lengths [128, 512, 2048, 8192], Output 256, 100 iterations Scenario 2 - Throughput under load: Concurrency [1, 4, 16, 64, 128], Input 512, Output 256, 5 min each Scenario 3 - Long context: Input [32K, 64K, 128K], Output 512, Concurrency 1 and 8 Scenario 4 - Mixed workload: Poisson arrival, varied input/output lengths

3. Benchmark Tools

  • genai-perf (NVIDIA) for TensorRT-LLM
  • Custom aiohttp load generator for HTTP APIs
  • locust for stress testing

4. Results Analysis

  • Latency distribution (histogram + percentiles)
  • Throughput vs latency tradeoff curve
  • Cost-per-token calculation
  • Framework comparison matrix

5. Optimization Recommendations

Based on results, suggest: batch size, tensor parallelism degree, quantization strategy, KV cache allocation

Please specify your serving framework, model, and hardware to get started: