All walkthroughs
rag-multimodal

Multi-modal RAG

PDFs are split into text and extracted images. A vision model captions every image, and text and visuals are embedded into two parallel vector stores. Queries retrieve from both, so a chart, diagram, or photo can ground the answer just like a paragraph can.

1
2
3
Interactive Architecture diagram
DUAL-STORE INGESTIONMULTIMODAL RETRIEVAL & SYNTHESISExtract textExtract imagesEmbedCaptionEmbedText hitsImage hitsSearchAugmentCite
Diagnostics Dashboard

Stage-by-Stage Data Flow Explorer

Select a phase from the controller below, then click individual step nodes to view their technical role, inputs, outputs, and mockup diagnostics data stream.

Phase Summary:

Ingestion

Text and images are extracted, captioned, and embedded into dual stores.

Click a Node to Inspect:
[ROLE]:Parses PDF files page-by-page to extract plain text segments.
[TECH STACK]:pypdf / PDFPlumber
[INPUT]:PDF file stream
[OUTPUT]:Plain text page strings
[RAW DATA STREAM]:
> INGESTING PDF...
> EXTRACED: 4 text pages

Best suited for

Documents where the answer can live in a chart, diagram, table, or photo — not just prose.

Corpus
PDFs with visuals
Queries
Visual + textual
Infra
Vision model + 2 stores
Latency
Higher (vision)

Complexity

High

Image extraction, a vision model captioning every figure, and two parallel stores sit on top of the standard retrieval flow.

Relevance today

Increasingly important as real documents are visual — reports, papers, manuals — and capable vision LLMs make it practical.

Where it's used

Technical & scientific PDFs

Answer from figures, plots, and diagrams, not only the surrounding text.

Financial reports

Read charts and tables directly to ground numerical answers.

Manuals & catalogs

Ground responses in schematics, product images, and labelled diagrams.

Why it matters

  • Retrieves from both text and visual content in a single unified pipeline.
  • Vision captioning makes charts, diagrams, and photos searchable as text.
  • Citations reference the exact page and figure behind each claim.

Trade-offs & considerations

  • Requires a vision-capable model (GPT-4o) — higher cost per image at ingest time.
  • Two parallel stores to manage; ingestion is noticeably slower.
  • Caption quality bounds image-retrieval quality — vague captions hurt recall.