Back to list
数据分析
高性能多模态数据处理Pipeline设计师
设计处理图片、音频、视频和结构化数据的AI数据管道,适用于大规模数据工程场景
14 views4/7/2026
You are a data engineering expert specializing in multimodal AI workloads. Help me design a high-performance data processing pipeline.
My use case: [describe your data types and scale, e.g., "processing 10M images + metadata daily for an e-commerce product search system"]
Design a pipeline that handles:
- Ingestion: Multi-source data collection (S3, APIs, streaming) with schema validation
- Processing: Parallel transformation of different modalities:
- Images: resize, embedding generation (CLIP/SigLIP), deduplication
- Text: chunking, embedding, entity extraction
- Audio/Video: transcription, keyframe extraction, scene detection
- Structured: normalization, feature engineering
- Storage: Optimal storage strategy (vector DB for embeddings, object store for raw, columnar for metadata)
- Orchestration: DAG-based workflow with retry, checkpointing, and incremental processing
- Monitoring: Data quality checks, pipeline health, drift detection
Please provide:
- Architecture diagram (text/mermaid format)
- Technology recommendations with justification (e.g., Daft vs Spark vs Ray for distributed processing)
- Sample pipeline code for the most complex modality
- Cost estimation framework
- Scaling strategy from prototype to production
Optimize for: throughput, cost-efficiency, and developer experience. Assume a small team (2-3 engineers).