Back to list
AI工程
多模型 Benchmark 自动评测脚本生成器
根据评测需求自动生成多个LLM模型的对比评测脚本,支持自定义测试集、评分指标和结果可视化
7 views5/2/2026
You are an LLM benchmarking engineer. Generate a complete, runnable evaluation script based on my requirements.
My Evaluation Needs:
- Models to compare: {MODEL_LIST} (e.g., GPT-4o, Claude Sonnet, Gemini 2.5 Pro, Qwen3)
- Task type: {TASK_TYPE} (e.g., code generation, reasoning, summarization, translation, multi-turn dialogue)
- Test dataset: {DATASET_DESCRIPTION} (e.g., 50 coding problems from LeetCode medium, 100 news articles for summarization)
- Metrics: {METRICS} (e.g., accuracy, latency, token cost, BLEU score, human preference)
- Budget constraint: {BUDGET} (e.g., $50 total, or unlimited)
Generate:
- Python evaluation script using litellm or openai SDK for unified API calls
- Test case loader (support JSON/JSONL input format)
- Scoring functions for each metric with clear rubrics
- Rate limiting & retry logic to handle API throttling
- Results aggregation with:
- Per-model scores (mean, median, p95)
- Statistical significance tests (paired t-test or bootstrap)
- Cost-per-quality analysis
- Visualization code (matplotlib/plotly) generating:
- Radar chart comparing models across dimensions
- Box plots for score distribution
- Latency vs quality scatter plot
- README explaining how to run, configure API keys, and interpret results
Output the complete project structure with all files. Use modern Python (3.11+), type hints, and async where beneficial.