All walkthroughs
rag-vectorless

PageIndex RAG

No embeddings, no vector database, no infrastructure. Documents are tokenized into a BM25Okapi index persisted as a pickle file. Retrieval is pure lexical scoring — fast, deterministic, and runnable on a laptop.

1
2
3
Interactive Architecture diagram
INDEX — BUILT ONCELOCAL LEXICAL QUERY ENGINETokenizeScorePersistLoads indexSearchRankGenerate
Diagnostics Dashboard

Stage-by-Stage Data Flow Explorer

Select a phase from the controller below, then click individual step nodes to view their technical role, inputs, outputs, and mockup diagnostics data stream.

Phase Summary:

Indexing

Chunks are tokenized and scored into a classic BM25 index — saved as a single file.

Click a Node to Inspect:
[ROLE]:Reads source documents from files, repositories, or URLs, parsing the binary content and encoding it into standard, clean UTF-8 text passages.
[TECH STACK]:PyPDF2 / Docx2txt / LangChain WebBaseLoader / PDFPlumber
[INPUT]:Raw binary data stream (PDF, DOCX, TXT, HTML, JSON)
[OUTPUT]:Normalized string representing document plaintext content with structure metadata
[RAW DATA STREAM]:
> INGEST_STREAM: "financial_report_2026.pdf" (Size: 2.4 MB)
> DECODING_META: { mime: "application/pdf", pages: 12 }
> READOUT: "Ragiment Corp Annual Report 2026. EBITDA grew 18% to $4.6M. Product lines expanded by..."

Best suited for

Small corpora and zero-infrastructure deployments — runnable entirely on a laptop.

Corpus
< ~10K docs
Queries
Keyword lookup
Infra
None (local file)
Latency
Very low

Complexity

Minimal

No database, no embeddings, no GPU — just a BM25 index serialized to a file. The simplest RAG you can ship.

Relevance today

A classic information-retrieval technique that's still genuinely useful for prototypes, edge/offline apps, and exact-keyword search.

Where it's used

Prototypes & demos

Stand up a working RAG pipeline in minutes with no database to provision.

Edge & offline apps

Run fully local — no network, no GPU, no external dependencies.

Compliance-restricted setups

No data leaves the machine; the index is a single file on disk.

Why it matters

  • Zero infrastructure — no vector database, no embeddings, no GPU.
  • Deterministic and instant; the BM25 index serializes to one pickle file.
  • Excellent at exact-keyword, code, and identifier search.

Trade-offs & considerations

  • Purely lexical — it misses paraphrases and semantic matches a vector search would catch.
  • Doesn't scale to very large corpora as gracefully as vector retrieval.
  • Relevance is keyword-frequency based; there is no semantic ranking.