Self-RAG: Grounded Answers with Quality Gates

Build Self-RAG in LangGraph with document grading plus hallucination and answer-quality gates that regenerate or rewrite until the answer is grounded and useful.

Jun 17, 20267 min readFollow

Topics You Will Master

Grading retrieved documents and dropping irrelevant ones
Checking a generated answer for hallucination against the source facts
Checking whether the answer actually resolves the query
Transforming a failed query into focused sub-queries and retrying

Self-RAG (Asai et al., 2023) adds two quality gates after generation. It grades the retrieved documents, generates an answer, then checks the answer for hallucinations (is it grounded in the documents?) and usefulness (does it actually address the query?). Failing either gate sends the flow back to regenerate or to rewrite the query and retrieve again.

This lesson builds on the `retrieve_docs` tool from earlier in the series. The nodes defined here live in scripts/nodes.py and are reused by the final lesson, Adaptive RAG.

Prerequisites: The scripts/my_tools.py tools from RAG Data Retrieval and Re-Ranking. Ollama running with qwen3, plus the packages below.

BASH
pip install -U langgraph langchain-ollama langchain-core pydantic
ollama pull qwen3
95% OFF

Private Agentic RAG with LangGraph and Ollama

Step-by-step guide to building private, self-correcting RAG systems with LangGraph, ChromaDB, and local models like Qwen3 and gpt-oss.

Enroll Now — 95% OFF →

Schemas and State

Four structured-output graders drive the decisions, each returning a simple 'yes'/'no' binary score.

PYTHON
from typing_extensions import TypedDict, Annotated
from typing import List
import os, operator

from langgraph.graph import StateGraph, START, END
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage
from pydantic import BaseModel, Field
from scripts import my_tools

llm = ChatOllama(model="qwen3", base_url="http://localhost:11434", reasoning=True)

class GradeDocuments(BaseModel):
    """Binary score for relevance check on retrieved documents."""
    binary_score: str = Field(description="Documents are relevant to the query, 'yes' or 'no'")

class GradeHallucinations(BaseModel):
    """Binary score for hallucination present in generation answer."""
    binary_score: str = Field(description="Answer is grounded with the facts for the query, 'yes' or 'no'")

class GradeAnswer(BaseModel):
    """Binary score to assess answer addresses query."""
    binary_score: str = Field(description="Answer addresses the query, 'yes' or 'no'")

class SearchQueries(BaseModel):
    """Search queries for retrieving missing information."""
    search_queries: list[str] = Field(description="1-3 search queries to retrieve the missing information.")

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    retrieved_docs: str
    rewritten_queries: List[str]

A helper finds the latest human message, since the message list grows as the graph loops.

PYTHON
def get_latest_user_query(messages: list):
    for message in reversed(messages):
        if isinstance(message, HumanMessage):
            return message.content
    return messages[0].content if messages else ''

Retrieve and Grade Documents

The document-grading gate routing to generation or to query transformation

Retrieval honors rewritten queries when present. The grader filters out erroneous retrievals by emptying retrieved_docs on a "no" verdict, which the router downstream interprets as "transform the query."

PYTHON
def retrieve_node(state):
    query = get_latest_user_query(state['messages'])
    rewritten_queries = state.get('rewritten_queries', [])
    queries_to_search = rewritten_queries if rewritten_queries else [query]

    all_results = []
    for idx, search_query in enumerate(queries_to_search, 1):
        result = my_tools.retrieve_docs.invoke({'query': search_query, 'k': 3})
        all_results.append(f"## Query {idx}: {search_query}\n\n### Retrieved Documents:\n{result}")

    combined_result = "\n\n".join(all_results)
    return {'retrieved_docs': combined_result}

def grade_documents_node(state):
    query = get_latest_user_query(state['messages'])
    documents = state.get('retrieved_docs', 'No document available!')

    llm_structured = llm.with_structured_output(GradeDocuments)
    system_prompt = """You are a grader assessing relevance of retrieved documents to a user query.
                It does not need to be a stringent test. The goal is to filter out erroneous retrievals.
                Give a binary score 'yes' or 'no' to indicate whether the document is relevant to the query."""

    messages = [SystemMessage(system_prompt),
                HumanMessage(f"Retrieved Document: {documents}\n\nUser query: {query}")]
    response = llm_structured.invoke(messages)
    print(f"[GRADE] Relevance:  {response.binary_score}")

    return {'retrieved_docs': documents} if response.binary_score == 'yes' else {'retrieved_docs': ''}

Generate and Transform-Query Nodes

A failed query decomposed into one to three focused sub-queries

The generate node produces the cited Markdown answer. The transform-query node decomposes the original question into focused sub-queries when retrieval fails, avoiding queries it already tried.

PYTHON
def generate_node(state):
    print("[GENERATE] Creating Answer")
    query = get_latest_user_query(state['messages'])
    documents = state.get('retrieved_docs', '')

    system_prompt = """You are a financial document analyst providing detailed, accurate answers.
                Write a comprehensive answer (200-300 words) in MARKDOWN with ## headings,
                **bold**, bullet/numbered lists, and inline citations [1], [2].
                Base your answer ONLY on the provided documents. End with a References list."""

    messages = [SystemMessage(system_prompt),
                HumanMessage(f"Retrieved Document: {documents}\n\nUser query: {query}")]
    response = llm.invoke(messages)
    return {'messages': [response]}

def transform_query_node(state):
    query = get_latest_user_query(state['messages'])
    rewritten_queries = state.get('rewritten_queries', [])
    llm_structured = llm.with_structured_output(SearchQueries)

    system_prompt = """You are a query re-writer that decomposes complex queries into focused search queries optimized for vectorstore retrieval.

                DECOMPOSITION STRATEGY:
                Break down the original query into 1-3 specific, focused queries where each targets:
                - A single company, a specific time period, a specific metric, or a specific document section

                GUIDELINES:
                - Expand abbreviations (e.g., "rev" -> "revenue", "GOOGL" -> "Google")
                - Make each query self-contained and specific (5-10 words each)
                - Avoid repeating previously tried queries"""

    query_context = f"Original Query: {query}"
    if rewritten_queries:
        query_context += "\n\n These queries have been already generated. do not generate same queries again.\n"
        for idx, q in enumerate(rewritten_queries, 1):
            query_context += f"Query {idx}: {q}\n\n"
    query_context += "\n\nGenerate 1-3 focused search queries that decompose the original query."

    response = llm_structured.invoke([SystemMessage(system_prompt), HumanMessage(query_context)])
    print(f"New Search Queries: {response.search_queries}")
    return {"rewritten_queries": response.search_queries}

The Two Quality Gates

Two gates checking the answer is grounded in the facts and resolves the query

should_generate routes to transform_query when no relevant docs survived grading. check_answer_quality runs the hallucination grader first; if the answer is grounded, it checks usefulness — ending on success, rewriting the query if the answer misses the point, or regenerating if it is ungrounded.

PYTHON
def should_generate(state):
    retrieved_docs = state.get('retrieved_docs', '')
    if not retrieved_docs or retrieved_docs.strip() == '':
        print("[ROUTER] No relevant documents - transforming query")
        return 'transform_query'
    print("[ROUTER] Have relevant documents - generating answer")
    return 'generate'

def check_answer_quality(state):
    query = get_latest_user_query(state['messages'])
    documents = state.get('retrieved_docs', '')
    generation = state['messages'][-1].content

    llm_hallucinations = llm.with_structured_output(GradeHallucinations)
    response = llm_hallucinations.invoke([
        SystemMessage("Grade whether the generation is grounded in the set of facts. Score 'yes' or 'no'."),
        HumanMessage(f"Set of facts:\n\n{documents}\n\nLLM Generation: {generation}")
    ])

    if response.binary_score == 'yes':
        print("[ROUTER] Generation is grounded in documents")
        llm_answer = llm.with_structured_output(GradeAnswer)
        answer_response = llm_answer.invoke([
            SystemMessage("Grade whether the answer resolves the query. Score 'yes' or 'no'."),
            HumanMessage(f"User Query: {query}\n\n LLM Generation: {generation}")
        ])
        if answer_response.binary_score == 'yes':
            print('[ROUTER] generation is good. - USEFUL')
            return END
        else:
            print("[ROUTER] Generation does not address the query - NOT USEFUL")
            return "transform_query"
    else:
        print("[ROUTER] Generation NOT grounded in the response")
        return 'generate'

Building the Graph

The Self-RAG happy path: retrieve, grade, generate, then gate for grounding and usefulness

Both conditional gates wire into the graph: one after document grading, one after generation.

PYTHON
def create_self_rag():
    builder = StateGraph(AgentState)

    builder.add_node('retrieve', retrieve_node)
    builder.add_node('grade_documents', grade_documents_node)
    builder.add_node('generate', generate_node)
    builder.add_node('transform_query', transform_query_node)

    builder.add_edge(START, 'retrieve')
    builder.add_edge('retrieve', 'grade_documents')
    builder.add_edge('transform_query', 'retrieve')

    builder.add_conditional_edges('grade_documents', should_generate, ['transform_query', 'generate'])
    builder.add_conditional_edges('generate', check_answer_quality, ['generate', END, 'transform_query'])

    return builder.compile()

agent = create_self_rag()

Testing Self-RAG

A straightforward question retrieves, grades, generates, and passes both gates:

PYTHON
query = "What was Amazon's revenue in 2023?"
result = agent.invoke({'messages': [HumanMessage(query)]})

A comparison question may rewrite the query when the first retrieval is too broad, then regenerate until the answer is grounded and useful:

PYTHON
query = "Compare Apple and Amazon revenue in 2024 q1"
result = agent.invoke({'messages': [HumanMessage(query)]})
result['messages'][-1].pretty_print()

Important

The double gate makes Self-RAG the most robust pattern against hallucination: an answer only reaches the user if it is both grounded in the documents and genuinely answers the question. The trade-off is more LLM calls per query.

The final lesson, Adaptive RAG, reuses these exact nodes as one of three routed paths.


What You Built

In this lesson you built a Self-RAG agent:

  • Document grader — drops irrelevant retrievals before generation
  • Hallucination gate — verifies the answer is grounded in the retrieved facts
  • Usefulness gate — verifies the answer actually resolves the query
  • Query transformer — decomposes failed queries into focused sub-queries and retries
  • Two conditional gatesshould_generate and check_answer_quality regenerate, rewrite, or finish

Self-RAG trades extra LLM calls for answers you can trust — the right choice when factual accuracy matters most.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments