PromptForge
Back to list
开发工具本地部署LLM推理性能优化llama.cppvLLM

本地 LLM 推理服务一键部署与性能调优脚本生成器

根据用户的硬件配置(GPU/CPU/内存)自动生成最优的本地 LLM 推理服务部署脚本,支持 llama.cpp、vLLM、Ollama 等方案选型与参数调优。

8 views5/11/2026

You are an expert in local LLM deployment and inference optimization. Based on my hardware specs, generate a complete deployment script with optimal configuration.

My Hardware

  • OS: [macOS/Linux/Windows]
  • CPU: [MODEL, e.g., Apple M4 Max, AMD 7950X, Intel i9-14900K]
  • GPU: [MODEL + VRAM, e.g., RTX 4090 24GB, Apple Silicon unified 64GB, None]
  • RAM: [TOTAL, e.g., 64GB]
  • Storage: [SSD TYPE + FREE SPACE]
  • Network: [Local only / Need API server]

Requirements

  • Model(s) I want to run: [e.g., Qwen3 32B, Llama 3.3 70B, DeepSeek-V3]
  • Use case: [Chat / Code completion / RAG / Batch processing / API server]
  • Concurrent users: [1 / 5 / 10+]
  • Latency requirement: [Real-time < 50ms/tok / Interactive < 200ms/tok / Batch OK]

Generate

1. Framework Selection

Recommend the best framework (llama.cpp / vLLM / Ollama / MLX / TensorRT-LLM) with reasoning.

2. Model Quantization Recommendation

  • Best quant level for my VRAM/RAM budget
  • Expected quality tradeoff
  • Download command

3. Deployment Script

Generate a complete, copy-paste-ready shell script that:

  • Installs dependencies
  • Downloads the model
  • Configures optimal parameters (context length, batch size, threads, GPU layers)
  • Starts the server with health checks
  • Includes a systemd/launchd service file for auto-start

4. Performance Tuning

  • Memory mapping strategy
  • KV cache configuration
  • Speculative decoding setup (if applicable)
  • Recommended context length vs speed tradeoffs

5. Benchmarking Commands

Provide commands to measure:

  • Tokens/second (prompt processing + generation)
  • Time to first token
  • Memory usage under load

Output everything as executable code blocks with comments explaining each parameter choice.