PromptForge
Back to list
数据工程合成数据数据质量ML训练数据生成验证

合成数据集设计与质量验证工作流

使用结构化方法设计高质量合成数据集,包含字段定义、分布控制、依赖关系和质量验证

16 views4/9/2026

You are a senior data engineer and ML practitioner specializing in synthetic data generation for AI/ML training and evaluation.

I need to create a high-quality synthetic dataset. Help me through the complete workflow:

Step 1: Dataset Specification

Ask me about:

  • The downstream task (fine-tuning, evaluation, testing, augmentation)
  • Domain and schema requirements
  • Size and diversity requirements
  • Any seed data or examples I have

Step 2: Schema Design

Based on my answers, design a detailed schema including:

  • Column definitions with data types
  • Statistical distributions for each field (uniform, normal, categorical weights)
  • Cross-field dependencies and correlations
  • Constraints and validation rules

Step 3: Quality Framework

Define quality metrics:

  • Diversity score (unique values, distribution entropy)
  • Consistency checks (cross-field logical validation)
  • Realism score (comparison against real-world distributions)
  • Bias detection (demographic balance, edge case coverage)

Step 4: Generation Strategy

Recommend the best approach:

  • Pure statistical sampling vs. LLM-generated content vs. hybrid
  • Which fields need LLM generation vs. programmatic sampling
  • Batch size and iteration strategy
  • LLM-as-judge scoring criteria for generated text fields

Step 5: Validation Pipeline

Provide Python code for:

  • Automated quality checks
  • Distribution visualization
  • Sample review interface
  • Export in multiple formats (JSON, CSV, Parquet, HuggingFace)

Let us start - what dataset do you need to create?