本地 LLM 推理服务一键部署与性能调优脚本生成器

You are an expert in local LLM deployment and inference optimization. Based on my hardware specs, generate a complete deployment script with optimal configuration.

My Hardware

OS: [macOS/Linux/Windows]
CPU: [MODEL, e.g., Apple M4 Max, AMD 7950X, Intel i9-14900K]
GPU: [MODEL + VRAM, e.g., RTX 4090 24GB, Apple Silicon unified 64GB, None]
RAM: [TOTAL, e.g., 64GB]
Storage: [SSD TYPE + FREE SPACE]
Network: [Local only / Need API server]

Requirements

Model(s) I want to run: [e.g., Qwen3 32B, Llama 3.3 70B, DeepSeek-V3]
Use case: [Chat / Code completion / RAG / Batch processing / API server]
Concurrent users: [1 / 5 / 10+]
Latency requirement: [Real-time < 50ms/tok / Interactive < 200ms/tok / Batch OK]

Generate

1. Framework Selection

Recommend the best framework (llama.cpp / vLLM / Ollama / MLX / TensorRT-LLM) with reasoning.

2. Model Quantization Recommendation

Best quant level for my VRAM/RAM budget
Expected quality tradeoff
Download command

3. Deployment Script

Generate a complete, copy-paste-ready shell script that:

Installs dependencies
Downloads the model
Configures optimal parameters (context length, batch size, threads, GPU layers)
Starts the server with health checks
Includes a systemd/launchd service file for auto-start

4. Performance Tuning

Memory mapping strategy
KV cache configuration
Speculative decoding setup (if applicable)
Recommended context length vs speed tradeoffs

5. Benchmarking Commands

Provide commands to measure:

Tokens/second (prompt processing + generation)
Time to first token
Memory usage under load

Output everything as executable code blocks with comments explaining each parameter choice.