PageRAG is a document-ingestion strategy that treats each page as a unit instead of splitting text into fixed-size chunks. Every page is stored with rich metadata — company, document type, fiscal year, quarter, and page number — which makes financial filings easy to filter ("Amazon 10-K, 2023, page 24") and keeps numerical tables intact within a page.
This is the first lesson in the Agentic RAG with LangGraph series. Here you build the ingestion pipeline; later lessons add retrieval and re-ranking and five agentic RAG patterns on top of it. The running example is financial analysis over SEC filings for Amazon, Apple, and Google, but the same pattern applies to legal research, medical records, enterprise knowledge bases, and academic paper indexing.
Prerequisites: Comfort with Python and basic vector-store concepts. You need Ollama installed and running, plus the packages below.
pip install -U docling langchain-chroma langchain-ollama langchain-core python-dotenv
ollama pull nomic-embed-text
Note
On Windows, download and run the installer from the Ollama website; the ollama command is then available in PowerShell.
On Linux/macOS: install with curl -fsSL https://ollama.com/install.sh | sh, then run the same ollama pull command.
Project Layout

The PDFs live under data/<company>/, ChromaDB persists to ./chroma_financial_db, and the code runs from the project root. All paths are relative, so they work identically on Windows, Linux, and macOS.
project/
├── data/
│ ├── amazon/ amazon 10-k 2023.pdf, amazon 10-q q1 2024.pdf, ...
│ ├── apple/ apple 10-k 2024.pdf, ...
│ └── google/ google 10-k 2023.pdf, ...
└── chroma_financial_db/
Configuration and Vector Store
Load environment variables, then configure the embedding model and the ChromaDB collection. The num_ctx=8192 setting gives the embedding model a large context window so full pages embed without truncation.
from dotenv import load_dotenv
load_dotenv()
import hashlib
from pathlib import Path
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_core.documents import Document
from docling.document_converter import DocumentConverter
DATA_DIR = "data"
CHROMA_DIR = "./chroma_financial_db"
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = 'nomic-embed-text'
BASE_URL = 'http://localhost:11434'
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url=BASE_URL, num_ctx=8192)
vector_store = Chroma(
collection_name=COLLECTION_NAME,
embedding_function=embeddings,
persist_directory=CHROMA_DIR
)
Extracting Metadata from Filenames

The PDFs follow a consistent naming convention — {company} {doc_type} {quarter} {year}.pdf (the quarter is omitted for annual 10-K reports). A small parser turns the filename into a metadata dictionary.
def extract_metadata_from_filename(filename: str) -> dict:
name = filename.replace('.pdf', '')
parts = name.split()
metadata = {}
if len(parts) == 4:
metadata['fiscal_quarter'] = parts[2]
metadata['fiscal_year'] = int(parts[3])
else:
metadata['fiscal_quarter'] = None
metadata['fiscal_year'] = int(parts[2])
metadata['company_name'] = parts[0]
metadata['doc_type'] = parts[1]
return metadata
extract_metadata_from_filename('amazon 10-q q1 2024.pdf')
{'fiscal_quarter': 'q1', 'fiscal_year': 2024, 'company_name': 'amazon', 'doc_type': '10-q'}
Page-wise PDF Extraction with Docling
Docling (IBM's open-source document converter) converts a PDF to Markdown and can insert a placeholder at every page break. Splitting on that placeholder yields a list of pages, preserving tables and headings as Markdown.
def extract_pdf_pages(pdf_path):
converter = DocumentConverter()
result = converter.convert(pdf_path)
page_break = "<!-- page break -->"
markdown_text = result.document.export_to_markdown(page_break_placeholder=page_break)
pages = markdown_text.split(page_break)
return pages
pages = extract_pdf_pages('data/amazon/amazon 10-q q1 2024.pdf')
len(pages)
52
Note
The first conversion downloads Docling's OCR and layout models, so it can take a while. Subsequent runs reuse the cached models, and Docling automatically uses your GPU if one is available.
Deduplication with File Hashing

To avoid re-ingesting the same document, compute a SHA-256 hash of each file's bytes. Two files with identical content produce the same hash, even if they have different names — so a renamed copy is correctly skipped.
def compute_file_hash(file_path: str) -> str:
sha256_hash = hashlib.sha256()
with open(file_path, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest()
compute_file_hash('data/amazon/amazon 10-q q1 2024.pdf')
'c08079bc14250c896f3ca151f9a72ecc1ddcb9ca8e5b021539e91af10fae5c4b'
Before ingesting, read the hashes already stored in ChromaDB so processed files can be skipped on re-runs.
existing_docs = vector_store.get(where={"file_hash": {"$ne": ""}}, include=['metadatas'])
processed_hashes = [m.get('file_hash') for m in existing_docs['metadatas'] if m.get('file_hash')]
processed_hashes = set(processed_hashes)
Ingesting Documents into ChromaDB

The ingestion function ties everything together: skip already-processed files, extract pages, attach metadata to each page as a Document, and add the documents to the vector store.
def ingest_docs_in_vectordb(pdf_path):
print(f"Processing: {pdf_path.name}")
file_hash = compute_file_hash(pdf_path)
if file_hash in processed_hashes:
print(f"[SKIP] already processed: {pdf_path}")
return
pages = extract_pdf_pages(pdf_path)
file_metadata = extract_metadata_from_filename(pdf_path.name)
processed_pages = []
for page_num, page_text in enumerate(pages, start=1):
metadata_dict = file_metadata.copy()
metadata_dict['page'] = page_num
metadata_dict['file_hash'] = file_hash
metadata_dict['source_file'] = pdf_path.name
doc = Document(page_content=page_text, metadata=metadata_dict)
processed_pages.append(doc)
vector_store.add_documents(documents=processed_pages)
Use rglob to find every PDF under the data directory and ingest them all.
data_path = Path(DATA_DIR)
pdf_files = list(data_path.rglob("*.pdf"))
for pdf_path in pdf_files:
ingest_docs_in_vectordb(pdf_path)
Confirm the collection size — each page is one document.
vector_store._collection.count()
1270
A quick similarity search verifies the store is queryable:
results = vector_store.search("What is Tesla's revenue for Q1 2024", search_type="similarity")
Tip
The store has no Tesla data — only Amazon, Apple, and Google. A plain similarity search still returns its closest matches regardless of relevance. Fixing exactly this problem — filtering by metadata and re-ranking by keyword — is the subject of the next lesson, RAG Data Retrieval and Re-Ranking.
What You Built
In this lesson you built the PageRAG ingestion pipeline:
- Filename metadata —
extract_metadata_from_filenameparses company, doc type, fiscal year, and quarter from the file name - Page-wise extraction — Docling converts each PDF to Markdown with page-break placeholders, preserving tables and headings
- Deduplication — a SHA-256 file hash skips documents that are already ingested, even renamed copies
- Rich-metadata storage — every page becomes a
Documenttagged with metadata and embedded bynomic-embed-textinto ChromaDB - A persisted collection — 1,270 page-level documents ready for filtered retrieval
This metadata-rich store is the foundation the rest of the series builds on. Next, you turn it into a precise retriever with metadata filtering and BM25 re-ranking.