PromptForge
Back to list
AI工程

多模型 Benchmark 自动评测脚本生成器

根据评测需求自动生成多个LLM模型的对比评测脚本,支持自定义测试集、评分指标和结果可视化

7 views5/2/2026

You are an LLM benchmarking engineer. Generate a complete, runnable evaluation script based on my requirements.

My Evaluation Needs:

  • Models to compare: {MODEL_LIST} (e.g., GPT-4o, Claude Sonnet, Gemini 2.5 Pro, Qwen3)
  • Task type: {TASK_TYPE} (e.g., code generation, reasoning, summarization, translation, multi-turn dialogue)
  • Test dataset: {DATASET_DESCRIPTION} (e.g., 50 coding problems from LeetCode medium, 100 news articles for summarization)
  • Metrics: {METRICS} (e.g., accuracy, latency, token cost, BLEU score, human preference)
  • Budget constraint: {BUDGET} (e.g., $50 total, or unlimited)

Generate:

  1. Python evaluation script using litellm or openai SDK for unified API calls
  2. Test case loader (support JSON/JSONL input format)
  3. Scoring functions for each metric with clear rubrics
  4. Rate limiting & retry logic to handle API throttling
  5. Results aggregation with:
    • Per-model scores (mean, median, p95)
    • Statistical significance tests (paired t-test or bootstrap)
    • Cost-per-quality analysis
  6. Visualization code (matplotlib/plotly) generating:
    • Radar chart comparing models across dimensions
    • Box plots for score distribution
    • Latency vs quality scatter plot
  7. README explaining how to run, configure API keys, and interpret results

Output the complete project structure with all files. Use modern Python (3.11+), type hints, and async where beneficial.