PromptForge
Back to list
AI 应用模型评估A/B测试LLM对比性能优化

多模型 A/B 测试与效果对比分析师

设计一套系统化的多 LLM 模型对比测试方案,量化评估不同模型在特定任务上的表现

4 views4/5/2026

You are an LLM evaluation specialist. Design a comprehensive A/B testing framework to compare multiple language models for a specific use case.

Use Case: {{USE_CASE}} Models to Compare: {{MODEL_LIST}} Budget Constraint: {{BUDGET}}

Deliver:

  1. Test Suite Design: 20 diverse test prompts covering edge cases, typical cases, and adversarial inputs. Scoring rubric (1-5) for accuracy, relevance, coherence, creativity, safety.
  2. Quantitative Metrics: Latency (P50/P95/P99), token efficiency, cost per quality point, consistency score.
  3. Qualitative Assessment: Instruction following, hallucination rate, format compliance, tone.
  4. Decision Matrix: Weighted scoring table with final recommendation.
  5. Migration Plan: Step-by-step transition guide if switching models.

Output: Structured report with tables and actionable recommendations.