#langchain#rag#faiss#retrieval-augmented-generation#runnablepassthrough#chatollama#nomic-embed-text#qwen3#python#ollama

RAG — Chat with Your Own Documents

Build a complete RAG chain that loads a persisted FAISS vector store and answers questions grounded strictly in your own documents using LangChain and Ollama.

Jun 4, 2026 at 10:30 AM7 min readFollowFollow (Hindi)

Topics You Will Master

Loading a persisted FAISS vector store with FAISS.load_local() and allow_dangerous_deserialization
Creating a retriever from a loaded vector store with as_retriever()
Wiring a complete RAG chain with RunnablePassthrough, a custom prompt, ChatOllama, and StrOutputParser
Understanding how {context} (from the retriever) and {question} (passed through) flow through the chain
Testing the RAG chain on questions both inside and related to the document corpus
Best For

Python developers who have built a FAISS vector store and want to connect it to an LLM for grounded, context-aware question answering.

Expected Outcome

A working RAG chain that retrieves relevant chunks from your own PDFs and answers questions in three sentences or fewer, strictly from the retrieved context.

Retrieval-Augmented Generation (RAG) solves the hallucination problem. Instead of asking an LLM to answer from its training data alone, RAG first retrieves the most relevant passages from your documents, then asks the LLM to answer using only those passages. The LLM becomes a reader and synthesizer, not an oracle.

This lesson picks up directly from the Vector Stores and Retrievals tutorial. The FAISS vector store built there is loaded from disk, wired to a retriever, and connected to a qwen3 LLM via an LCEL RAG chain.

Prerequisites: The health_supplements/ FAISS vector store saved in the previous lesson. langchain-community, langchain-ollama, faiss-cpu, and python-dotenv installed. Ollama running with qwen3 and nomic-embed-text.

LangChain & Ollama — Local AI Development

Build production-ready LLM apps entirely on your own hardware. No API keys, no cloud costs.

Enroll on Udemy →

Setup

PYTHON
import os
import warnings
from dotenv import load_dotenv

os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
warnings.filterwarnings("ignore")

load_dotenv()
OUTPUT
True

Loading the Vector Store

Import the embedding model and FAISS vector store classes:

PYTHON
from langchain_ollama import OllamaEmbeddings

import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

Load the saved vector store from disk. The embedding model must be the same one used to build the index — nomic-embed-text at 768 dimensions:

PYTHON
embeddings = OllamaEmbeddings(model='nomic-embed-text', base_url='http://localhost:11434')

db_name = r"..\09. Vector Stores and Retrievals\health_supplements"
vector_store = FAISS.load_local(db_name, embeddings, allow_dangerous_deserialization=True)

Important

allow_dangerous_deserialization=True is required because FAISS.load_local() unpickles the docstore. Only load vector stores you created yourself or from trusted sources. If you created the health_supplements/ folder in the previous lesson, this is safe.

On Linux/macOS: adjust db_name to use forward slashes: "../09. Vector Stores and Retrievals/health_supplements".


Retrieval

Test that the loaded vector store retrieves correct results before building the chain.

PYTHON
question = "how to gain muscle mass?"
docs = vector_store.search(query=question, k=5, search_type="similarity")

Returns the 5 most relevant chunks — passages about protein supplementation, creatine, and strength training goals from the gym supplements research papers.

Creating the Retriever

PYTHON
retriever = vector_store.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 3}
)

retriever.invoke(question)

Returns the top 3 relevant Document objects for the query. The retriever is the component that plugs directly into the LCEL RAG chain.

Testing Other Queries

PYTHON
question = "how to lose weight?"
retriever.invoke(question)

Returns chunks specifically about weight loss supplements from the health supplements PDF — passages on chromium, chitosan, Garcinia cambogia, and the limited evidence for supplement-based weight reduction.

MMR Retriever

PYTHON
retriever = vector_store.as_retriever(
    search_type='mmr',
    search_kwargs={'k': 3, 'fetch_k': 20, 'lambda_mult': 1}
)

docs = retriever.invoke(question)

Maximal Marginal Relevance fetches 20 candidates then selects the 3 that are most relevant and mutually diverse — preventing duplicate passages from adjacent pages of the same paper.


Building the RAG Chain

LLM

PYTHON
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
PYTHON
llm = ChatOllama(model='qwen3', base_url='http://localhost:11434')
llm.invoke('hi')
PYTHON
AIMessage(content='Hello! How can I assist you today? 😊', ...)

RAG Prompt

The prompt template is the critical control for RAG behaviour. It instructs the model to answer only from the retrieved context, to say "I don't know" when the answer is absent, and to keep responses concise:

PYTHON
prompt = """You are an assistant for question-answering tasks.
            Use the following pieces of retrieved context to answer the question.
            If you don't know the answer, just say that you don't know.
            Use three sentences maximum and keep the answer concise.

            Question: {question}
            Context: {context}

            Answer:"""

prompt = ChatPromptTemplate.from_template(prompt)

Note

In LangChain v0.3+, hub.pull("rlm/rag-prompt") is deprecated. Define the prompt inline as shown above. This gives you full control over the system instructions and eliminates the external dependency on LangChain Hub.

Format Documents Helper

PYTHON
def format_docs(docs):
    return '\n\n'.join([doc.page_content for doc in docs])

context = format_docs(docs)

format_docs joins all retrieved chunk texts with double newlines into a single context string passed to the prompt.

Assembling the Chain

PYTHON
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

The chain has three stages:

Input mapping — a dict with two keys:

  • "context" — the retriever pipes the query through FAISS, gets top-k Document objects, then format_docs joins their text into one string
  • "question"RunnablePassthrough() passes the original query string through unchanged
  1. PromptChatPromptTemplate fills {context} and {question} into the RAG prompt
  2. LLM + ParserChatOllama generates the answer; StrOutputParser extracts the string content

Tip

RunnablePassthrough() is the key to splitting a single input string into two parallel paths. The query string is both used by the retriever (to find relevant chunks) and passed directly into the prompt (as {question}).


Running the RAG Chain

PYTHON
question = "how to lose weight?"
response = rag_chain.invoke(question)

print(response)
OUTPUT
To lose weight, focus on lifestyle changes like balanced diet, regular physical activity, and avoiding addictive behaviors. Dietary supplements are not proven effective for weight loss and may pose health risks. Always consult healthcare professionals before using supplements, and avoid unregulated products.

The answer is grounded in the health supplements PDF — specifically the section on weight loss supplements and their limited efficacy. The LLM did not answer from general training knowledge; it synthesized the retrieved passages.

PYTHON
question = "how to gain muscle mass?"
response = rag_chain.invoke(question)

print(response)
OUTPUT
To gain muscle mass, use creatine monohydrate, which supports muscle growth without increasing fat. Combine it with strength training focused on muscle-building goals. Prioritize protein intake and consistent workout routines for optimal results.

The model synthesizes advice from the gym supplements research papers about creatine, protein supplementation, and training frequency — all sourced from the retrieved chunks, not from general knowledge.


How RAG Works: Step-by-Step

PYTHON
User Query: "how to lose weight?"
     │
     ▼
retriever.invoke(query)               ← FAISS similarity search (top-k=3)
     │   Returns 3 Document chunks
     ▼
format_docs(docs)                     ← Join page_content with "\n\n"
     │   Returns one context string
     ▼
ChatPromptTemplate.from_template()    ← Fills {context} + {question}
     │   Returns formatted messages
     ▼
ChatOllama(model='qwen3')             ← Generates answer from context
     │   Returns AIMessage
     ▼
StrOutputParser()                     ← Extracts content string
     │
     ▼
"To lose weight, focus on lifestyle changes..."

The LLM never sees the full 311-chunk corpus — only the 3 most relevant passages retrieved for each query. This keeps the prompt short, the answer accurate, and the cost low.


Quick Reference

Full RAG Setup from a Saved Vector Store

PYTHON
import os
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load vector store
embeddings = OllamaEmbeddings(model='nomic-embed-text', base_url='http://localhost:11434')
vector_store = FAISS.load_local(
    "path/to/health_supplements",
    embeddings,
    allow_dangerous_deserialization=True
)

# Retriever
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})

# Prompt
prompt_template = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.

Question: {question}
Context: {context}

Answer:"""

prompt = ChatPromptTemplate.from_template(prompt_template)

# LLM
llm = ChatOllama(model='qwen3', base_url='http://localhost:11434')

# Format helper
def format_docs(docs):
    return '\n\n'.join([doc.page_content for doc in docs])

# RAG Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask
response = rag_chain.invoke("how to lose weight?")
print(response)

What You Built

In this lesson you completed the full RAG pipeline:

Component Implementation
Vector store FAISS.load_local() — loads persisted 311-chunk index
Embedding model OllamaEmbeddings('nomic-embed-text') — 768-dim vectors
Retriever as_retriever(search_type='similarity', k=3)
RAG prompt ChatPromptTemplate — context-only answering, 3-sentence limit
LLM ChatOllama('qwen3') — local inference
Output StrOutputParser() — clean string answer
Chain `{context: retriever

The answer to every question is sourced directly from the retrieved PDF chunks — not from the model's general training data. This is the core promise of RAG: accurate, verifiable, document-grounded answers.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments