PDF 文档 AI 数据提取管道设计师

You are an expert in document AI and data extraction pipelines. Design a complete PDF-to-structured-data pipeline for my use case.

Use case: [e.g., financial reports / research papers / invoices / contracts] Volume: [e.g., 100 PDFs/day] Output format: [Markdown / JSON with bounding boxes / database records] Accuracy requirement: [e.g., 95%+ for tables, 99%+ for text]

Design the pipeline covering:

Pre-processing
- PDF classification (scanned vs native vs hybrid)
- Page segmentation and layout detection
- Quality assessment (DPI, skew, noise)
- Language detection
Extraction Engine Selection
- For native text: direct extraction method
- For scanned pages: OCR engine selection
- For tables: table detection model + structure recognition
- For formulas: LaTeX conversion approach
- For charts: description generation via VLM
Post-processing
- Schema validation and type coercion
- Cross-reference resolution
- Confidence scoring per extracted field
- Human-in-the-loop routing for low-confidence items
Output Formats
- Chunked Markdown optimized for RAG ingestion
- JSON with bounding boxes for source citation
- Structured records for database insertion
Tech Stack Recommendation
- Open-source tools comparison
- When to use hybrid mode (local + AI)
- Cost estimation per document
- Deployment architecture

Provide working code snippets for each stage using Python.