3 phases

rag-vectorless

PageIndex RAG

No embeddings, no vector database, no infrastructure. Documents are tokenized into a BM25Okapi index persisted as a pickle file. Retrieval is pure lexical scoring — fast, deterministic, and runnable on a laptop.

Generate this pipeline Component docs

Indexing

Retrieval

Generation

Interactive Architecture diagram

Diagnostics Dashboard

Stage-by-Stage Data Flow Explorer

Select a phase from the controller below, then click individual step nodes to view their technical role, inputs, outputs, and mockup diagnostics data stream.

Phase Summary:

Indexing

Chunks are tokenized and scored into a classic BM25 index — saved as a single file.

Click a Node to Inspect:

[ROLE]:Reads source documents from files, repositories, or URLs, parsing the binary content and encoding it into standard, clean UTF-8 text passages.

[TECH STACK]:PyPDF2 / Docx2txt / LangChain WebBaseLoader / PDFPlumber

[INPUT]:Raw binary data stream (PDF, DOCX, TXT, HTML, JSON)

[OUTPUT]:Normalized string representing document plaintext content with structure metadata

[RAW DATA STREAM]:

> INGEST_STREAM: "financial_report_2026.pdf" (Size: 2.4 MB)
> DECODING_META: { mime: "application/pdf", pages: 12 }
> READOUT: "Ragiment Corp Annual Report 2026. EBITDA grew 18% to $4.6M. Product lines expanded by..."

Best suited for

Small corpora and zero-infrastructure deployments — runnable entirely on a laptop.

Corpus

< ~10K docs

Queries

Keyword lookup

Infra

None (local file)

Latency

Very low

Complexity

Minimal

No database, no embeddings, no GPU — just a BM25 index serialized to a file. The simplest RAG you can ship.

Relevance today

A classic information-retrieval technique that's still genuinely useful for prototypes, edge/offline apps, and exact-keyword search.

Where it's used

Prototypes & demos

Stand up a working RAG pipeline in minutes with no database to provision.

Edge & offline apps

Run fully local — no network, no GPU, no external dependencies.

Compliance-restricted setups

No data leaves the machine; the index is a single file on disk.

Why it matters

Zero infrastructure — no vector database, no embeddings, no GPU.
Deterministic and instant; the BM25 index serializes to one pickle file.
Excellent at exact-keyword, code, and identifier search.

Trade-offs & considerations

Purely lexical — it misses paraphrases and semantic matches a vector search would catch.
Doesn't scale to very large corpora as gracefully as vector retrieval.
Relevance is keyword-frequency based; there is no semantic ranking.

Alternatives to consider

When PageIndex RAG isn't the right fit, reach for one of these instead.

rag-standardStandard RAG

When you need semantic matching and the corpus grows past a few thousand documents.

rag-wikiLLM Wiki

When the corpus is static and you want higher answer quality from synthesized topics.

More architectures

Explore the other pipelines

View all

rag-standardStandard RAGHybrid vector + BM25 retrieval. The production baseline.Walk through

rag-graphGraphRAGKnowledge-graph retrieval for multi-hop reasoning.Walk through

rag-agenticAgentic RAGA ReAct agent decides when — and how — to retrieve.Walk through

rag-wikiLLM WikiCompounding knowledge — the corpus becomes a wiki.Walk through

rag-multimodalMulti-modal RAGText + images retrieved together. Vision-grounded answers.Walk through