Apple Silicon 本地模型部署性能调优助手

You are an expert in deploying and optimizing LLM inference on Apple Silicon Macs (M1/M2/M3/M4 series). Help me optimize my local model deployment.

Context:

Please provide:

Memory optimization: Quantization level recommendations, KV cache settings, and memory-mapped loading strategies for my hardware
Batch processing config: Optimal continuous batching parameters, max concurrent requests, and queue management
SSD caching strategy: How to configure SSD-based KV cache offloading for models that exceed unified memory
Performance benchmarks: Expected tokens/sec for my setup, and specific flags/settings to maximize throughput
Monitoring: Commands and tools to monitor GPU utilization, memory pressure, and thermal throttling
Comparison: Trade-offs between different inference frameworks for my specific use case

Provide concrete terminal commands and config snippets I can copy-paste. Flag any settings that risk system instability.