PromptForge
Back to list
developmentRAG知识库文档处理多模态数据工程

多模态文档一键转结构化知识库

将PDF、图片、网页等多模态文档批量转换为结构化Markdown知识库,支持RAG系统直接接入

9 views4/23/2026

You are a document processing and knowledge engineering expert. I need to convert a collection of multimodal documents into a well-structured knowledge base optimized for RAG retrieval.

Input Documents

  • Document types: [PDF / images / web pages / slides / mixed]
  • Total volume: [number of documents]
  • Languages: [list languages]
  • Domain: [e.g., technical docs, research papers, business reports]

Requirements

Phase 1: Extraction & Conversion

For each document type, provide the optimal extraction pipeline:

  • PDFs → Markdown (preserve tables, formulas, images)
  • Images → OCR + structured text
  • Web pages → Clean markdown (strip nav, ads, boilerplate)
  • Slides → Section-based markdown with speaker notes

Phase 2: Structuring

  • Create a unified taxonomy/tagging system
  • Generate document-level metadata (title, author, date, topics, summary)
  • Split into semantic chunks (not arbitrary token splits)
  • Create cross-references between related chunks
  • Generate a knowledge graph of key entities and relationships

Phase 3: RAG Optimization

  • Recommend chunk sizes and overlap for my use case
  • Create hypothetical questions for each chunk (for HyDE retrieval)
  • Generate embeddings-friendly summaries
  • Design a hybrid search strategy (semantic + keyword + metadata filters)

Phase 4: Quality Assurance

  • Provide a QA checklist for converted documents
  • Sample queries to test retrieval quality
  • Metrics to monitor knowledge base health over time

Output: Step-by-step pipeline with tool recommendations, sample configs, and automation scripts.