PageRAG Data Ingestion with Docling & ChromaDB

Build a page-wise PDF ingestion pipeline with Docling — filename metadata, SHA-256 deduplication, and local nomic-embed-text embeddings stored in ChromaDB.

Jun 17, 202614 min readFollow

Topics You Will Master

Parsing PDFs page by page into Markdown with Docling
Extracting company, document type, fiscal year, and quarter from filenames
Deduplicating documents with a SHA-256 file hash before ingestion
Embedding each page with nomic-embed-text and storing it in ChromaDB with rich metadata

PageRAG is a document-ingestion strategy that treats each page as a unit instead of splitting text into fixed-size chunks. Every page is stored with rich metadata — company, document type, fiscal year, quarter, and page number — which makes financial filings easy to filter ("Amazon 10-K, 2023, page 24") and keeps numerical tables intact within a page.

This is the first lesson in the Agentic RAG with LangGraph series. Here you build the ingestion pipeline; later lessons add retrieval and re-ranking and five agentic RAG patterns on top of it. The running example is financial analysis over SEC filings for Amazon, Apple, and Google, but the same pattern applies to legal research, medical records, enterprise knowledge bases, and academic paper indexing.

Prerequisites: Comfort with Python and basic vector-store concepts. You need Ollama installed and running, plus the packages below.

BASH
pip install -U docling langchain-chroma langchain-ollama langchain-core python-dotenv
ollama pull nomic-embed-text

Note

On Windows, download and run the installer from the Ollama website; the ollama command is then available in PowerShell.

On Linux/macOS: install with curl -fsSL https://ollama.com/install.sh | sh, then run the same ollama pull command.

95% OFF

Private Agentic RAG with LangGraph and Ollama

Step-by-step guide to building private, self-correcting RAG systems with LangGraph, ChromaDB, and local models like Qwen3 and gpt-oss.

Enroll Now — 95% OFF →

Project Layout

PageRAG ingestion pipeline: PDF filings parsed page-wise by Docling, tagged with filename metadata and a file hash, embedded with nomic-embed-text, and stored in ChromaDB

The PDFs live under data/<company>/, ChromaDB persists to ./chroma_financial_db, and the code runs from the project root. All paths are relative, so they work identically on Windows, Linux, and macOS.

PLAINTEXT
project/
├── data/
│   ├── amazon/  amazon 10-k 2023.pdf, amazon 10-q q1 2024.pdf, ...
│   ├── apple/   apple 10-k 2024.pdf, ...
│   └── google/  google 10-k 2023.pdf, ...
└── chroma_financial_db/

Configuration and Vector Store

Load environment variables, then configure the embedding model and the ChromaDB collection. The num_ctx=8192 setting gives the embedding model a large context window so full pages embed without truncation.

PYTHON
from dotenv import load_dotenv
load_dotenv()

import hashlib
from pathlib import Path

from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_core.documents import Document

from docling.document_converter import DocumentConverter
PYTHON
DATA_DIR = "data"
CHROMA_DIR = "./chroma_financial_db"
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = 'nomic-embed-text'
BASE_URL = 'http://localhost:11434'
PYTHON
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url=BASE_URL, num_ctx=8192)

vector_store = Chroma(
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
    persist_directory=CHROMA_DIR
)

Extracting Metadata from Filenames

Parsing a filename into company, document type, fiscal year, and quarter metadata

The PDFs follow a consistent naming convention — {company} {doc_type} {quarter} {year}.pdf (the quarter is omitted for annual 10-K reports). A small parser turns the filename into a metadata dictionary.

PYTHON
def extract_metadata_from_filename(filename: str) -> dict:
    name = filename.replace('.pdf', '')
    parts = name.split()

    metadata = {}
    if len(parts) == 4:
        metadata['fiscal_quarter'] = parts[2]
        metadata['fiscal_year'] = int(parts[3])
    else:
        metadata['fiscal_quarter'] = None
        metadata['fiscal_year'] = int(parts[2])

    metadata['company_name'] = parts[0]
    metadata['doc_type'] = parts[1]

    return metadata
PYTHON
extract_metadata_from_filename('amazon 10-q q1 2024.pdf')
OUTPUT
{'fiscal_quarter': 'q1', 'fiscal_year': 2024, 'company_name': 'amazon', 'doc_type': '10-q'}

Page-wise PDF Extraction with Docling

Docling (IBM's open-source document converter) converts a PDF to Markdown and can insert a placeholder at every page break. Splitting on that placeholder yields a list of pages, preserving tables and headings as Markdown.

PYTHON
def extract_pdf_pages(pdf_path):
    converter = DocumentConverter()
    result = converter.convert(pdf_path)

    page_break = "<!-- page break -->"
    markdown_text = result.document.export_to_markdown(page_break_placeholder=page_break)

    pages = markdown_text.split(page_break)
    return pages
PYTHON
pages = extract_pdf_pages('data/amazon/amazon 10-q q1 2024.pdf')
len(pages)
OUTPUT
52

Note

The first conversion downloads Docling's OCR and layout models, so it can take a while. Subsequent runs reuse the cached models, and Docling automatically uses your GPU if one is available.

Deduplication with File Hashing

A SHA-256 file hash deciding whether a document is new or already ingested

To avoid re-ingesting the same document, compute a SHA-256 hash of each file's bytes. Two files with identical content produce the same hash, even if they have different names — so a renamed copy is correctly skipped.

PYTHON
def compute_file_hash(file_path: str) -> str:
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()
PYTHON
compute_file_hash('data/amazon/amazon 10-q q1 2024.pdf')
OUTPUT
'c08079bc14250c896f3ca151f9a72ecc1ddcb9ca8e5b021539e91af10fae5c4b'

Before ingesting, read the hashes already stored in ChromaDB so processed files can be skipped on re-runs.

PYTHON
existing_docs = vector_store.get(where={"file_hash": {"$ne": ""}}, include=['metadatas'])
processed_hashes = [m.get('file_hash') for m in existing_docs['metadatas'] if m.get('file_hash')]
processed_hashes = set(processed_hashes)

Ingesting Documents into ChromaDB

Each page combined with its metadata into a Document and stored in ChromaDB

The ingestion function ties everything together: skip already-processed files, extract pages, attach metadata to each page as a Document, and add the documents to the vector store.

PYTHON
def ingest_docs_in_vectordb(pdf_path):
    print(f"Processing: {pdf_path.name}")

    file_hash = compute_file_hash(pdf_path)
    if file_hash in processed_hashes:
        print(f"[SKIP] already processed: {pdf_path}")
        return

    pages = extract_pdf_pages(pdf_path)
    file_metadata = extract_metadata_from_filename(pdf_path.name)

    processed_pages = []
    for page_num, page_text in enumerate(pages, start=1):
        metadata_dict = file_metadata.copy()
        metadata_dict['page'] = page_num
        metadata_dict['file_hash'] = file_hash
        metadata_dict['source_file'] = pdf_path.name

        doc = Document(page_content=page_text, metadata=metadata_dict)
        processed_pages.append(doc)

    vector_store.add_documents(documents=processed_pages)

Use rglob to find every PDF under the data directory and ingest them all.

PYTHON
data_path = Path(DATA_DIR)
pdf_files = list(data_path.rglob("*.pdf"))

for pdf_path in pdf_files:
    ingest_docs_in_vectordb(pdf_path)

Confirm the collection size — each page is one document.

PYTHON
vector_store._collection.count()
OUTPUT
1270

A quick similarity search verifies the store is queryable:

PYTHON
results = vector_store.search("What is Tesla's revenue for Q1 2024", search_type="similarity")

Tip

The store has no Tesla data — only Amazon, Apple, and Google. A plain similarity search still returns its closest matches regardless of relevance. Fixing exactly this problem — filtering by metadata and re-ranking by keyword — is the subject of the next lesson, RAG Data Retrieval and Re-Ranking.


What You Built

In this lesson you built the PageRAG ingestion pipeline:

  • Filename metadataextract_metadata_from_filename parses company, doc type, fiscal year, and quarter from the file name
  • Page-wise extraction — Docling converts each PDF to Markdown with page-break placeholders, preserving tables and headings
  • Deduplication — a SHA-256 file hash skips documents that are already ingested, even renamed copies
  • Rich-metadata storage — every page becomes a Document tagged with metadata and embedded by nomic-embed-text into ChromaDB
  • A persisted collection — 1,270 page-level documents ready for filtered retrieval

This metadata-rich store is the foundation the rest of the series builds on. Next, you turn it into a precise retriever with metadata filtering and BM25 re-ranking.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments