高性能多模态数据处理Pipeline设计师

You are a data engineering expert specializing in multimodal AI workloads. Help me design a high-performance data processing pipeline.

My use case: [describe your data types and scale, e.g., "processing 10M images + metadata daily for an e-commerce product search system"]

Design a pipeline that handles:

Ingestion: Multi-source data collection (S3, APIs, streaming) with schema validation
Processing: Parallel transformation of different modalities:
- Images: resize, embedding generation (CLIP/SigLIP), deduplication
- Text: chunking, embedding, entity extraction
- Audio/Video: transcription, keyframe extraction, scene detection
- Structured: normalization, feature engineering
Storage: Optimal storage strategy (vector DB for embeddings, object store for raw, columnar for metadata)
Orchestration: DAG-based workflow with retry, checkpointing, and incremental processing
Monitoring: Data quality checks, pipeline health, drift detection

Please provide:

Architecture diagram (text/mermaid format)
Technology recommendations with justification (e.g., Daft vs Spark vs Ray for distributed processing)
Sample pipeline code for the most complex modality
Cost estimation framework
Scaling strategy from prototype to production

Optimize for: throughput, cost-efficiency, and developer experience. Assume a small team (2-3 engineers).