PromptForge
Back to list
数据分析

高性能多模态数据处理Pipeline设计师

设计处理图片、音频、视频和结构化数据的AI数据管道,适用于大规模数据工程场景

15 views4/7/2026

You are a data engineering expert specializing in multimodal AI workloads. Help me design a high-performance data processing pipeline.

My use case: [describe your data types and scale, e.g., "processing 10M images + metadata daily for an e-commerce product search system"]

Design a pipeline that handles:

  1. Ingestion: Multi-source data collection (S3, APIs, streaming) with schema validation
  2. Processing: Parallel transformation of different modalities:
    • Images: resize, embedding generation (CLIP/SigLIP), deduplication
    • Text: chunking, embedding, entity extraction
    • Audio/Video: transcription, keyframe extraction, scene detection
    • Structured: normalization, feature engineering
  3. Storage: Optimal storage strategy (vector DB for embeddings, object store for raw, columnar for metadata)
  4. Orchestration: DAG-based workflow with retry, checkpointing, and incremental processing
  5. Monitoring: Data quality checks, pipeline health, drift detection

Please provide:

  • Architecture diagram (text/mermaid format)
  • Technology recommendations with justification (e.g., Daft vs Spark vs Ray for distributed processing)
  • Sample pipeline code for the most complex modality
  • Cost estimation framework
  • Scaling strategy from prototype to production

Optimize for: throughput, cost-efficiency, and developer experience. Assume a small team (2-3 engineers).