3 phases

rag-multimodal

Multi-modal RAG

PDFs are split into text and extracted images. A vision model captions every image, and text and visuals are embedded into two parallel vector stores. Queries retrieve from both, so a chart, diagram, or photo can ground the answer just like a paragraph can.

Generate this pipeline Component docs

Ingestion

Retrieval

Generation

Interactive Architecture diagram

Diagnostics Dashboard

Stage-by-Stage Data Flow Explorer

Select a phase from the controller below, then click individual step nodes to view their technical role, inputs, outputs, and mockup diagnostics data stream.

Phase Summary:

Ingestion

Text and images are extracted, captioned, and embedded into dual stores.

Click a Node to Inspect:

[ROLE]:Parses PDF files page-by-page to extract plain text segments.

[TECH STACK]:pypdf / PDFPlumber

[INPUT]:PDF file stream

[OUTPUT]:Plain text page strings

[RAW DATA STREAM]:

> INGESTING PDF...
> EXTRACED: 4 text pages

Best suited for

Documents where the answer can live in a chart, diagram, table, or photo — not just prose.

Corpus

PDFs with visuals

Queries

Visual + textual

Infra

Vision model + 2 stores

Latency

Higher (vision)

Complexity

High

Image extraction, a vision model captioning every figure, and two parallel stores sit on top of the standard retrieval flow.

Relevance today

Increasingly important as real documents are visual — reports, papers, manuals — and capable vision LLMs make it practical.

Where it's used

Technical & scientific PDFs

Answer from figures, plots, and diagrams, not only the surrounding text.

Financial reports

Read charts and tables directly to ground numerical answers.

Manuals & catalogs

Ground responses in schematics, product images, and labelled diagrams.

Why it matters

Retrieves from both text and visual content in a single unified pipeline.
Vision captioning makes charts, diagrams, and photos searchable as text.
Citations reference the exact page and figure behind each claim.

Trade-offs & considerations

Requires a vision-capable model (GPT-4o) — higher cost per image at ingest time.
Two parallel stores to manage; ingestion is noticeably slower.
Caption quality bounds image-retrieval quality — vague captions hurt recall.

Alternatives to consider

When Multi-modal RAG isn't the right fit, reach for one of these instead.

rag-standardStandard RAG

If your documents are text-only, the vision pipeline is unnecessary overhead.

rag-graphGraphRAG

When structure and relationships matter more than the visuals themselves.

More architectures

Explore the other pipelines

View all

rag-standardStandard RAGHybrid vector + BM25 retrieval. The production baseline.Walk through

rag-graphGraphRAGKnowledge-graph retrieval for multi-hop reasoning.Walk through

rag-agenticAgentic RAGA ReAct agent decides when — and how — to retrieve.Walk through

rag-vectorlessPageIndex RAGZero vector DB. BM25 + pickle persistence.Walk through

rag-wikiLLM WikiCompounding knowledge — the corpus becomes a wiki.Walk through