Pipeline walkthroughs

How each pipeline actually works.

Six architectures, visualized stage by stage. Follow the data from raw documents through retrieval to a grounded answer — then generate the code.

architectures

total phases

pipeline steps

[SYS_OK: RAG-STANDARD]

COMPLEXITY: █░░░ [EASY]

Standard RAG

Hybrid vector + BM25 retrieval. The production baseline.

> STAGE:Ingestion

> RECAP:Load documents, split them into overlapping chunks, embed each chunk, and store the vectors.

> PATHS:

Document LoaderDocument LoaderReads raw files in any supported format and normalizes them into clean plain text.➔Text SplitterText SplitterSplits each document into overlapping ~512-token chunks so context isn't lost at boundaries.➔EmbedderEmbedderTurns every chunk into a dense vector that captures its meaning for similarity search.➔Vector StoreVector StorePersists the vectors (with their text) in a vector database for fast nearest-neighbour lookup.

> BLUEPRINT:Fuses dense vector semantic retrieval with BM25 keyword matching for optimal recall.

[PHASES: 3 | STEPS: 11]ENGAGE WALKTHROUGH

[SYS_OK: RAG-GRAPH]

COMPLEXITY: ███░ [HIGH]

GraphRAG

Knowledge-graph retrieval for multi-hop reasoning.

> STAGE:Indexing

> RECAP:An LLM extracts entities and relationships from each document and writes them into a graph.

> PATHS:

Document LoaderDocument LoaderIngests the corpus and feeds it to the extraction stage document by document.➔Entity ExtractionEntity ExtractionAn LLM identifies named entities — people, organizations, concepts — in each document.➔Relationship MiningRelationship MiningExtracts subject→predicate→object triples that connect those entities.➔Knowledge GraphKnowledge GraphStores entities as nodes and relationships as edges in a Neo4j knowledge graph.

> BLUEPRINT:Translates natural language questions into Cypher traversals to query relationship databases.

[PHASES: 3 | STEPS: 10]ENGAGE WALKTHROUGH

[SYS_OK: RAG-AGENTIC]

COMPLEXITY: ████ [EXPERT]

Agentic RAG

A ReAct agent decides when — and how — to retrieve.

> STAGE:Agent Loop

> RECAP:Think → act → observe, repeated until the agent is confident in its evidence.

> PATHS:

ReAct ReasonerReAct ReasonerGenerates a thought about what information is still missing to answer the question.➔Tool SelectorTool SelectorPicks which tool to call next based on the agent's current reasoning.➔ObservationObservationFeeds the tool's result back into the loop as fresh evidence for the next thought.➔Loop ControllerLoop ControllerCaps the number of iterations so the agent always terminates with an answer.

> BLUEPRINT:A dynamic agent loop that acts, observes, and decides when to search or summarize.

[PHASES: 3 | STEPS: 9]ENGAGE WALKTHROUGH

[SYS_OK: RAG-VECTORLESS]

COMPLEXITY: ██░░ [MEDIUM]

PageIndex RAG

Zero vector DB. BM25 + pickle persistence.

> STAGE:Indexing

> RECAP:Chunks are tokenized and scored into a classic BM25 index — saved as a single file.

> PATHS:

Document LoaderDocument LoaderReads local files — no cloud, GPU, or vector database required.➔TokenizerTokenizerLowercases and splits text into terms for purely lexical matching.➔BM25Okapi IndexBM25Okapi IndexBuilds a classic BM25 (tf-idf) relevance index over the tokenized corpus.➔Pickle PersistencePickle PersistenceSerializes the entire index to a single index.pkl file for instant reload.

> BLUEPRINT:An offline local database-free query engine that uses serialized BM25 indexing.

[PHASES: 3 | STEPS: 9]ENGAGE WALKTHROUGH

[SYS_OK: RAG-WIKI]

COMPLEXITY: ███░ [HIGH]

LLM Wiki

Compounding knowledge — the corpus becomes a wiki.

> STAGE:Ingestion

> RECAP:The LLM clusters the corpus into topics and authors an article for each one.

> PATHS:

Document LoaderDocument LoaderIngests the full corpus so the model can survey it for recurring topics.➔Topic DiscoveryTopic DiscoveryAn LLM clusters the corpus into a set of coherent, distinct topics.➔Article WriterArticle WriterSynthesizes one wiki-style article per topic from the underlying source documents.➔Article StoreArticle StoreEmbeds each article so queries can be matched at the topic level.

> BLUEPRINT:Clusters documents and compiles them into structured topic articles before retrieval.

[PHASES: 3 | STEPS: 9]ENGAGE WALKTHROUGH

[SYS_OK: RAG-MULTIMODAL]

COMPLEXITY: ███░ [HIGH]

Multi-modal RAG

Text + images retrieved together. Vision-grounded answers.

> STAGE:Ingestion

> RECAP:Text and images are extracted, captioned, and embedded into dual stores.

> PATHS:

PDF ParserPDF ParserExtracts the text of each PDF page by page with pypdf.➔Image ExtractorImage ExtractorPulls out embedded figures, charts, and photos from the documents.➔Vision CaptionerVision CaptionerA vision model describes each image so visuals become searchable text.➔Dual Vector StoresDual Vector StoresIndexes text passages and image captions in two parallel vector stores.

> BLUEPRINT:Processes and embeds text and visual content (charts, figures) into parallel stores.

[PHASES: 3 | STEPS: 10]ENGAGE WALKTHROUGH