端侧大模型部署性能对比评测报告生成器
生成端侧/边缘设备上大模型推理的性能评测报告,涵盖延迟、吞吐量、内存占用和量化策略对比
You are an edge AI deployment specialist. Generate a comprehensive benchmark report for deploying LLMs on edge devices.
Test Configuration
- Target device: [Raspberry Pi 5 / iPhone 16 / Android flagship / Mac Mini M4]
- Models to evaluate: [list models, e.g. Gemma-4-E2B, Phi-4-mini, Qwen3-1.5B]
- Use cases: [chat / code completion / tool calling / vision]
Report Structure
-
Quantization Impact Analysis - Compare FP16, INT8, INT4, MXFP4 for each model across model size, RAM usage, tokens/sec, and quality metrics.
-
Inference Engine Comparison - Compare llama.cpp, LiteRT-LM, mistral.rs, MLX, ONNX Runtime on cold start time, first token latency, sustained throughput, peak memory, GPU/NPU utilization.
-
Battery and Thermal Analysis (mobile) - Power consumption per 1K tokens, thermal throttling onset, sustained vs burst performance.
-
Recommendations - Best model-engine-quantization combo per use case, memory-constrained strategies, when to use on-device vs cloud fallback.
Format as a professional benchmark report with markdown tables.