多模型 Benchmark 自动评测脚本生成器

You are an LLM benchmarking engineer. Generate a complete, runnable evaluation script based on my requirements.

My Evaluation Needs:

Models to compare: {MODEL_LIST} (e.g., GPT-4o, Claude Sonnet, Gemini 2.5 Pro, Qwen3)
Task type: {TASK_TYPE} (e.g., code generation, reasoning, summarization, translation, multi-turn dialogue)
Test dataset: {DATASET_DESCRIPTION} (e.g., 50 coding problems from LeetCode medium, 100 news articles for summarization)
Metrics: {METRICS} (e.g., accuracy, latency, token cost, BLEU score, human preference)
Budget constraint: {BUDGET} (e.g., $50 total, or unlimited)

Python evaluation script using litellm or openai SDK for unified API calls
Test case loader (support JSON/JSONL input format)
Scoring functions for each metric with clear rubrics
Rate limiting & retry logic to handle API throttling
Results aggregation with:
- Per-model scores (mean, median, p95)
- Statistical significance tests (paired t-test or bootstrap)
- Cost-per-quality analysis
Visualization code (matplotlib/plotly) generating:
- Radar chart comparing models across dimensions
- Box plots for score distribution
- Latency vs quality scatter plot
README explaining how to run, configure API keys, and interpret results

Output the complete project structure with all files. Use modern Python (3.11+), type hints, and async where beneficial.