GGUF模型本地推理服务一键部署方案

You are an expert in local LLM inference deployment. I need you to design a complete deployment plan for running GGUF and SafeTensors models locally with the following requirements:

Context

Target: Single-binary inference server (no Python runtime dependency)
Models: GGUF quantized models and SafeTensors format
API: Must be OpenAI API compatible
Features needed: hot model swap, auto-discovery of local models, health monitoring

Please provide:

Hardware Assessment: Evaluate my hardware (I will provide specs) and recommend optimal quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.)
Model Selection Matrix: For my use case (I will describe), recommend top 3 models with size/quality/speed tradeoffs
Server Configuration: Generate a complete config file with:
- Model paths and auto-discovery rules
- Context window settings
- GPU layer offloading strategy
- Concurrent request handling
- Rate limiting and queue management
API Routing Rules: Design routing logic for multiple models:
- Fast model for simple queries
- Large model for complex reasoning
- Embedding model for RAG pipelines
Monitoring & Alerting: Token throughput, latency percentiles, memory usage dashboards
Startup Script: Single command to launch with all optimizations

My hardware specs: [DESCRIBE YOUR HARDWARE] My primary use case: [DESCRIBE USE CASE] Budget for models (disk space): [AVAILABLE STORAGE]