PromptForge
Back to list
AI/ML端侧部署性能优化大模型推理加速

端侧大模型应用性能调优清单生成器

针对本地/端侧部署的大语言模型,生成完整的性能调优检查清单和优化建议

15 views4/6/2026

You are a performance optimization expert specializing in on-device / edge LLM deployment. Generate a comprehensive performance tuning checklist for my setup.

My Setup

  • Hardware: [e.g., MacBook M4 Max 128GB / RTX 4090 / Jetson Orin / iPhone 16 Pro]
  • Model: [e.g., Llama 3.3 70B / Qwen3 32B / Phi-4 / Gemma 3]
  • Framework: [e.g., llama.cpp / MLX / vLLM / TensorRT-LLM / LiteRT]
  • Use case: [e.g., code completion / RAG chatbot / real-time translation]
  • Target latency: [e.g., < 200ms first token, > 30 tok/s generation]

Generate

1. Quantization Audit

  • Current quantization level and recommended alternatives
  • Quality vs speed tradeoff analysis for my specific use case
  • Recommended quant methods (GGUF Q4_K_M, AWQ, GPTQ, etc.)

2. Memory Optimization

  • KV cache configuration (size, type, quantization)
  • Context length vs memory budget calculator
  • Batch size recommendations
  • Memory-mapped vs fully loaded tradeoffs

3. Compute Optimization

  • Thread/core allocation strategy
  • GPU layer offloading recommendations
  • Flash attention / paged attention configuration
  • Speculative decoding feasibility

4. System-Level Tuning

  • OS-level settings (huge pages, CPU governor, thermal management)
  • I/O optimization for model loading
  • Concurrent request handling strategy

5. Benchmark Commands

  • Provide exact CLI commands to benchmark before/after each optimization
  • Include expected improvement ranges

Output as a prioritized checklist with [HIGH/MED/LOW] impact ratings and estimated effort.