Personal Gym Supplements RAG

Build a domain-specific RAG pipeline for health supplement research papers with custom metadata extraction and hybrid retrieval.

Jun 18, 20268 min readFollow

Topics You Will Master

Adapting RAGWire from finance to health domains with minimal config changes
Writing custom metadata extraction prompts for research papers
Ingesting health supplement PDFs with domain-specific field extraction
Querying across supplements, ingredients, and research findings

RAGWire works beyond finance. This article demonstrates how little changes when you switch from SEC filings to health supplement research papers — a different domain, different data, different metadata schema, same pipeline.

Complete the RAGWire Architecture and Setup article first to understand the core pipeline, and RAGWire Providers and Components for Qdrant Cloud setup.

95% OFF

Advanced RAG – Build & Deploy Production GenAI Apps

Build RAGWire from scratch — multi-agent RAG with LangGraph, CrewAI, AutoGen, FastAPI, and Chainlit.

Enroll Now — 95% OFF →

What Changes Between Domains

Moving from finance to health requires changing only three things:

  1. Data — Health supplement PDFs instead of SEC filings
  2. Metadata confighealth_metadata.yaml instead of finance_metadata.yaml
  3. System prompt — Tuned for health and fitness instead of financial analysis

The config YAML, pipeline code, tools, and agent logic stay identical.

Health Metadata Configuration

Create health_metadata.yaml with fields relevant to research papers:

YAML
prompt: |
  You are an expert research paper analyst specializing in health, fitness, and sports science.
  Your task is to extract structured metadata from the research paper below.

  ## Extraction Rules
  1. **Be thorough**: Extract every field you can find. A field should only be null if the information is completely absent.
  2. **Be precise**: Extract exactly what is stated. Do not infer, assume, or hallucinate values not present in the document.
  3. **Lists**: Scan the entire document and extract ALL matching values  not just the first occurrence.
  4. **Strings**: Normalize to lowercase. Trim extra whitespace.
  5. **Integers**: Return the numeric value only  no units, symbols, or surrounding text.
  6. **Null**: Return null only when the field is genuinely not mentioned anywhere in the document.

fields:
  - name: title
    description: "Full title of the research paper exactly as it appears in the document. Do not paraphrase."

  - name: authors
    description: "List of full author names exactly as they appear in the paper (e.g. 'John A. Smith'). Extract all authors."
    type: list

  - name: publication_year
    description: "Year the paper was published or last revised. Extract the 4-digit year only."
    type: integer

  - name: research_focus
    description: "List of all primary research topics covered, in lowercase-hyphenated format. Not limited to the examples — extract any focus area mentioned in the paper."
    type: list
    values: ["muscle-growth", "recovery", "performance", "endurance", "cognitive-function", "fat-loss", "safety", "hormonal"]

This metadata schema extracts paper-specific fields: title, authors, publication year, and research focus areas. The values list in research_focus provides examples but is not restrictive — the LLM can extract any focus area mentioned in the paper.

Configuration

The config file points to Qdrant Cloud and references the health metadata:

YAML
# config_gemini_qdrant.yaml
embeddings:
  provider: "google"
  model: "models/gemini-embedding-001"
  api_key: "${GOOGLE_API_KEY}"

llm:
  provider: "google"
  model: "gemini-2.5-flash"
  api_key: "${GOOGLE_API_KEY}"

vectorstore:
  url: "${QDRANT_URL}"
  api_key: "${QDRANT_API_KEY}"
  collection_name: "health-rag-google-qdrant"
  use_sparse: true
  force_recreate: false

retriever:
  search_type: "hybrid"
  top_k: 5
  auto_filter: false

metadata:
  config_file: "health_metadata.yaml"

logging:
  level: "INFO"
  console_output: true
  colored: false
  log_file: "./.log/ragwire.log"

The only differences from the finance config are:

  • collection_namehealth-rag-google-qdrant instead of finance-rag-google-qdrant
  • metadata.config_filehealth_metadata.yaml instead of finance_metadata.yaml

Setup and Ingest

PYTHON
from dotenv import load_dotenv
load_dotenv(override=True)

from ragwire import RAGWire, setup_logging
import ragwire

logger = setup_logging(log_level="INFO")
print(ragwire.__version__)
OUTPUT
1.2.7
PYTHON
rag = RAGWire('config_gemini_qdrant.yaml')

Ingest all health supplement research papers:

PYTHON
rag.ingest_directory('../data/health_data')
OUTPUT
{'total': 11, 'processed': 10, 'skipped': 1, 'failed': 0, 'chunks_created': 75, 'errors': []}

The health data directory contains research papers on creatine supplementation, vitamin D and performance, protein muscle synthesis, caffeine and beta-alanine, hydration, sleep, and functional foods for athletes. RAGWire processes all of them in a single call, extracting title, authors, publication year, and research focus from each paper.

Basic Retrieval

Query the collection with simple keyword and semantic searches:

PYTHON
rag.retrieve("protein", top_k=3)

Results return chunks from the creatine and protein supplementation papers, each with full metadata including the extracted research focus areas.

PYTHON
rag.retrieve("what are the benefits of vitamin d?", top_k=3)

Hybrid search combines dense semantic matching with sparse keyword retrieval to surface relevant chunks from the vitamin D performance paper.

Building the Agent

The same two tools from the finance pipeline — get_filter_context and search_documents — work unchanged. The only difference is the system prompt, tuned for a health and fitness assistant:

PYTHON
from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.checkpoint.memory import InMemorySaver

@tool
def get_filter_context(query: str) -> str:
    """Get available metadata fields, stored values, and filter suggestions for a query.

    Call this before search_documents when the query involves a specific company,
    year, or document type. Skip for purely semantic queries.
    """
    return rag.get_filter_context(query)

@tool
def search_documents(query: str, filters=None):
    """Search the document knowledge base for relevant information.

    Args:
        query: The search query
        filters: Optional metadata filters from get_filter_context.
    """
    results = rag.retrieve(query=query, filters=filters)
    if not results:
        return "No relevant information is found!"
    else:
        return results

agent = create_agent(
    model=ChatGoogleGenerativeAI(model="gemini-2.5-flash"),
    tools=[get_filter_context, search_documents],
    system_prompt=(
        "You are a helpful health and fitness research assistant. "
        "For complex questions, break them down into simple sub-questions. "
        "Always use search_documents to retrieve information — never answer from general knowledge. "
        "Use get_filter_context before search_documents when the query involves specific metadata. "
        "Always cite the source paper in your answer."
    ),
    checkpointer=InMemorySaver(),
)

Interactive Q&A

PYTHON
config = {"configurable": {"thread_id": "demo"}}

print("\nRAG Agent ready. Type 'quit' to exit.\n")

while True:
    question = input("You: ").strip()
    if question.lower() in ("quit", "exit", "q"):
        break
    if not question:
        continue
    response = agent.invoke(
        {"messages": [HumanMessage(question)]},
        config=config,
    )
    print(f"\nAgent: {response['messages'][-1].text}\n")

Example queries to try:

  • "What are the benefits of vitamin D for athletes?"
  • "What does the research say about creatine and muscle growth?"
  • "How does caffeine affect endurance performance?"
  • "What hydration strategies are recommended for athletes?"

The agent retrieves from the health supplement research papers and cites source documents, demonstrating that RAGWire's pipeline generalises across domains with minimal configuration changes.

Tip

Compare the quality of retrieval in this health pipeline with the finance pipeline from the previous articles. The same hybrid search strategy, same top_k, same agent architecture — different domain, different metadata, same precision.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments