Multi-modal RAG
PDFs are split into text and extracted images. A vision model captions every image, and text and visuals are embedded into two parallel vector stores. Queries retrieve from both, so a chart, diagram, or photo can ground the answer just like a paragraph can.
Stage-by-Stage Data Flow Explorer
Select a phase from the controller below, then click individual step nodes to view their technical role, inputs, outputs, and mockup diagnostics data stream.
Ingestion
Text and images are extracted, captioned, and embedded into dual stores.
> INGESTING PDF... > EXTRACED: 4 text pages
Best suited for
Documents where the answer can live in a chart, diagram, table, or photo — not just prose.
Complexity
Image extraction, a vision model captioning every figure, and two parallel stores sit on top of the standard retrieval flow.
Relevance today
Increasingly important as real documents are visual — reports, papers, manuals — and capable vision LLMs make it practical.
Where it's used
Technical & scientific PDFs
Answer from figures, plots, and diagrams, not only the surrounding text.
Financial reports
Read charts and tables directly to ground numerical answers.
Manuals & catalogs
Ground responses in schematics, product images, and labelled diagrams.
Why it matters
- Retrieves from both text and visual content in a single unified pipeline.
- Vision captioning makes charts, diagrams, and photos searchable as text.
- Citations reference the exact page and figure behind each claim.
Trade-offs & considerations
- Requires a vision-capable model (GPT-4o) — higher cost per image at ingest time.
- Two parallel stores to manage; ingestion is noticeably slower.
- Caption quality bounds image-retrieval quality — vague captions hurt recall.
Alternatives to consider
When Multi-modal RAG isn't the right fit, reach for one of these instead.