#langchain#document-loaders#pymupdf#webbaseloader#unstructured#markitdown#docling#rag#pdf#python#ollama

LangChain Document Loaders

Load PDFs, webpages, PowerPoint, Excel, and Word files into LangChain for Q&A, summarization, and report generation — plus MarkitDown and Docling for advanced extraction.

Jun 4, 2026 at 10:30 AM19 min readFollowFollow (Hindi)

Topics You Will Master

Loading single and batches of PDFs with PyMuPDFLoader for Q&A and summarization
Counting tokens with tiktoken to understand LLM context window usage
Building context-aware Q&A and summarization chains with ChatPromptTemplate
Scraping multiple webpages asynchronously with WebBaseLoader and alazy_load()
Cleaning raw web text with regex and chunking long documents for LLM processing
Loading PowerPoint slides with UnstructuredPowerPointLoader and generating speaker scripts
Reading Excel files with UnstructuredExcelLoader and querying table data with an LLM
Loading Word documents with Docx2txtLoader for personalized job application generation
Converting any file (PDF, DOCX, PPTX, XLSX, images, YouTube URLs) to Markdown with Microsoft's **MarkitDown**
Converting PDFs to structured Markdown, HTML, and CSV tables using IBM's **Docling**, including figure extraction
Best For

Python developers building RAG pipelines or document-processing applications who need to ingest and query content from diverse file formats.

Expected Outcome

A complete document loading toolkit — five loader patterns covering every major format — wired to an Ollama LLM chain for Q&A, summarization, and structured report generation.

Document loaders are LangChain's bridge between raw files and LLM chains. They read a source — a local PDF, a live webpage, a PowerPoint deck, an Excel sheet, a Word document — and return a list of Document objects: each with a page_content string and a metadata dict. From there, the content can be passed directly to any LCEL chain for Q&A, summarization, or report generation.

This lesson covers five loader patterns across five notebooks, plus two next-generation converters — MarkitDown (Microsoft) and Docling (IBM) — that go beyond basic loading to produce structured Markdown and extracted tables from complex documents.

Prerequisites: LangChain, langchain-community, langchain-ollama, pymupdf, tiktoken, python-dotenv, unstructured, openpyxl, python-pptx, docx2txt, markitdown, and docling installed. Ollama running locally with qwen3.

Note

Install the full unstructured package for Office file support: pip install "unstructured[all-docs]". Install MarkitDown with pip install markitdown and Docling with pip install docling.

LangChain & Ollama — Local AI Development

Build production-ready LLM apps entirely on your own hardware. No API keys, no cloud costs.

Enroll on Udemy →

PDF Document Loaders

Loading a Single PDF

PyMuPDFLoader reads a PDF file and returns one Document per page. The metadata dict includes source path, page number, total pages, format, author, creation date, and more.

PYTHON
from dotenv import load_dotenv

load_dotenv('.env')
OUTPUT
True

On Linux/macOS: use load_dotenv('./../.env') if .env is in a parent directory.

PYTHON
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("rag-dataset/health supplements/1. dietary supplements - for whom.pdf")

docs = loader.load()
PYTHON
len(docs)
OUTPUT
17

Each element in docs is one page. Access content and metadata via docs[0].page_content and docs[0].metadata.

Loading All PDFs in a Directory

Walk the rag-dataset/ folder and load every .pdf file into a single flat list:

PYTHON
import os

pdfs = []
for root, dirs, files in os.walk("rag-dataset"):
    for file in files:
        if file.endswith(".pdf"):
            pdfs.append(os.path.join(root, file))

docs = []
for pdf in pdfs:
    loader = PyMuPDFLoader(pdf)
    temp = loader.load()
    docs.extend(temp)

len(docs)
OUTPUT
64

64 pages across all PDFs in the dataset. Build a single context string by joining all page contents:

PYTHON
def format_docs(docs):
    return "\n\n".join([x.page_content for x in docs])

context = format_docs(docs)

Understanding Token Count with tiktoken

Before sending context to an LLM, check how many tokens it uses. LLMs have context window limits and token count affects cost and quality.

PYTHON
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4o-mini")

Verify encoding works on sample strings:

PYTHON
encoding.encode("congratulations"), encoding.encode("rqsqeft")
OUTPUT
([542, 111291, 14571], [81, 31847, 80, 5276])

Count tokens for one page and the full context:

PYTHON
len(encoding.encode(docs[0].page_content))
OUTPUT
968
PYTHON
len(encoding.encode(context))
OUTPUT
58181

969 * 64

62016

The full dataset is ~58,000 tokens — well within a 128K-context model like qwen3 for a direct Q&A, but worth chunking for smaller models.

Project 1: Q&A from PDF

Build a Q&A chain that answers strictly from the loaded PDF context. The system prompt instructs the model not to answer outside the provided context:

PYTHON
from langchain_ollama import ChatOllama
from langchain_core.prompts import (SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser

base_url = "http://localhost:11434"
model = 'qwen3'
llm = ChatOllama(base_url=base_url, model=model)

system = SystemMessagePromptTemplate.from_template("""You are helpful AI assistant who answer user question based on the provided context.
                                                    Do not answer in more than {words} words""")

prompt = """Answer user question based on the provided context ONLY! If you do not know the answer, just say "I don't know".
            ### Context:
            {context}

            ### Question:
            {question}

            ### Answer:"""

prompt = HumanMessagePromptTemplate.from_template(prompt)

messages = [system, prompt]
template = ChatPromptTemplate(messages)

qna_chain = template | llm | StrOutputParser()

Confirm the chain structure:

PYTHON
qna_chain
PYTHON
ChatPromptTemplate(input_variables=['context', 'question', 'words'], ...)
| ChatOllama(model='qwen3', base_url='http://localhost:11434')
| StrOutputParser()

Ask a question grounded in the document:

PYTHON
response = qna_chain.invoke({'context': context, 'question': "How to gain muscle mass?", 'words': 50})
print(response)
OUTPUT
To gain muscle mass (hypertrophy), a combination of resistance training, proper nutrition, recovery, and lifestyle factors is essential. Here's a structured approach:

### 1. Resistance Training (Exercise)
- Focus on Compound Movements: Prioritize exercises that work multiple muscle groups...
- Progressive Overload: Gradually increase weight, reps, or intensity over time...
- Training Frequency: Train each major muscle group 2–3 times per week...
- Rep Range: Aim for 6–12 reps per set for hypertrophy...

### 2. Nutrition
- Protein Intake: Consume 1.2–2.2 grams of protein per kilogram of body weight daily...
- Caloric Surplus: Eat more calories than you burn — aim for a surplus of 250–500 calories/day...

Test the "I don't know" boundary with an out-of-scope question:

PYTHON
response = qna_chain.invoke({'context': context, 'question': "How many planets are there outside of our solar system?", 'words': 50})
print(response)
OUTPUT
As of the latest data up to 2023, there are over 5,000 confirmed exoplanets...

Note

The model answered with its general knowledge instead of saying "I don't know" — a well-known LLM behaviour. For strict RAG, use a retriever-based setup (covered in later lessons) so only relevant chunks are passed as context.

Project 2: PDF Summarization

Swap the prompt for a summarization task. The {words} variable lets you control the length of the summary at call time:

PYTHON
system = SystemMessagePromptTemplate.from_template("""You are helpful AI assistant who works as document summarizer.
                                                   You must not hallucinate or provide any false information.""")

prompt = """Summarize the given context in {words}.
            ### Context:
            {context}

            ### Summary:"""

prompt = HumanMessagePromptTemplate.from_template(prompt)
template = ChatPromptTemplate([system, prompt])

summary_chain = template | llm | StrOutputParser()

Short summary (50 words limit):

PYTHON
response = summary_chain.invoke({'context': context, 'words': 50})
print(response)
OUTPUT
Summary of Dietary Supplements and Nutraceuticals: Safety, Regulation, and Adverse Effects

Regulatory Frameworks:
- United States: DSHEA allows supplements to be sold without pre-market approval. FDA focuses on post-market safety monitoring.
- European Union: Stricter regulations under EC Regulation 1924/2006. EFSA evaluates health claims.
...
Key Adverse Effects: Excess vitamin A/E, omega-3 drug interactions, soy isoflavone estrogenic effects, banned stimulants (Ephedra, DMAA), herbal liver toxicity (Black Cohosh, Kava).

Longer summary (500 words):

PYTHON
response = summary_chain.invoke({'context': context, 'words': 500})
print(response)
OUTPUT
Summary of Dietary Supplements and Nutraceuticals: Benefits, Risks, and Regulatory Context

Regulatory Framework: In the U.S., dietary supplements are regulated under DSHEA of 1994. Unlike drugs, supplements do not require pre-market approval...

Key Supplement Categories and Risks:
1. Vitamins and Minerals: Essential for preventing deficiencies. Excess intake (vitamin A, E) can cause toxicity...
2. Omega-3 Fatty Acids: Linked to cardiovascular health. High doses may increase bleeding risk with anticoagulants...
3. Protein Powders: Support muscle growth. Soy isoflavones may mimic estrogen, raising reproductive concerns...
4. Weight-Loss Supplements: Ephedra and DMAA have been banned due to severe side effects (hypertension, heart attacks, liver damage)...
5. Botanical Supplements: Black cohosh, kava, ginkgo biloba carry liver toxicity and bleeding interaction risks...

Project 3: Report Generation

Generate a full structured Markdown report from the document context:

PYTHON
response = qna_chain.invoke({'context': context,
                             'question': "Provide a detailed report from the provided context. Write answer in Markdown.",
                             'words': 2000})
print(response)

The output is a full Markdown report with sections on regulatory frameworks, supplement categories, adverse effects, notable bans, and a conclusion — derived entirely from the loaded PDFs. You can save it directly to a file:

PYTHON
with open("data/report.md", "w", encoding="utf-8") as f:
    f.write(response)

Webpage Loaders

Project 1: Share Market Data Analysis

WebBaseLoader fetches multiple URLs concurrently and returns the extracted text as Document objects. Use alazy_load() for async loading, which is more efficient than the synchronous version for multiple URLs.

PYTHON
from langchain_community.document_loaders import WebBaseLoader

urls = [
    'https://economictimes.indiatimes.com/markets/stocks/news',
    'https://www.livemint.com/latest-news',
    'https://www.livemint.com/latest-news/page-2',
    'https://www.livemint.com/latest-news/page-3',
    'https://www.moneycontrol.com/'
]

Note

Set the USER_AGENT environment variable to identify your requests to web servers: os.environ["USER_AGENT"] = "MyApp/1.0". Without it, some sites may block or throttle your requests.

PYTHON
loader = WebBaseLoader(web_paths=urls)

docs = []
async for doc in loader.alazy_load():
    docs.append(doc)

Combine all page content into one context string:

PYTHON
def format_docs(docs):
    return "\n\n".join([x.page_content for x in docs])

context = format_docs(docs)

Cleaning Raw Web Text

Raw web scrapes contain repeated newlines, tabs, and excess whitespace. A regex cleaner normalizes the text before passing it to the LLM:

PYTHON
import re

def text_clean(text):
    text = re.sub(r'\n\n+', '\n\n', text)
    text = re.sub(r'\t+', '\t', text)
    text = re.sub(r'\s+', ' ', text)
    return text

context = text_clean(context)

The cleaned context is a continuous string of all scraped market news, headlines, and article snippets.

Chunking Long Contexts

When web-scraped context is too long for a single LLM call, chunk it with a sliding window and overlap to preserve sentence boundaries:

PYTHON
def chunk_text(text, chunk_size, overlap=100):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

chunks = chunk_text(context, 10_000)

Extracting Market News from Each Chunk

The shared llm.py helper script wraps the LangChain Q&A chain as a single ask_llm(context, question) call, imported as a module:

PYTHON
from scripts import llm
PYTHON
question = "Extract stock market news from the given text."

chunk_summary = []
for chunk in chunks:
    response = llm.ask_llm(chunk, question)
    chunk_summary.append(response)

Each chunk returns a structured market news summary. First chunk output:

PYTHON
for chunk in chunk_summary:
    print(chunk)
    break
OUTPUT
Here is the extracted stock market news from the provided text:

1. Nifty Index Performance: The Nifty25 index closed at 25,709.85, with a 124.55-point gain. It broke above a key chart pattern, signaling a bullish trend...
2. Featured Funds: HSBC Large Cap Fund Direct-Growth (5Y Return: 18.34%), UTI Aggressive Hybrid Fund Regular Plan-Growth (5Y Return: 19.89%)
3. Diwali 2025 Investment Picks: Small-cap stocks are highlighted for Samvat 2082 with up to 36% upside...
4. Market Trends: Rising VIX signals hedging. 9 Nifty500 stocks gained for 5 consecutive days...
5. Company Updates: Signature Global raised Rs 875 crore via debentures to reduce debt and expand...
6. Earnings Season: Banks' Q2 earnings under scrutiny. Mutual funds cut holdings in 10 stocks (down up to 70%)...

Generating the Final Market Report

Combine all chunk summaries and generate a single polished Markdown report:

PYTHON
summary = "\n\n".join(chunk_summary)

question = "Write a detailed market news report in markdown format. Think carefully then write the report."
response = llm.ask_llm(summary, question)

Save the summary and report to files:

PYTHON
import os
os.makedirs("data", exist_ok=True)

with open("data/report.md", "w", encoding="utf-8") as f:
    f.write(response)

with open("data/summary.md", "w", encoding="utf-8") as f:
    f.write(summary)

Microsoft Office Files: PPT, Excel, and Word

The scripts/llm.py Helper Module

Notebooks 2 and 3 use a shared LLM helper module (scripts/llm.py) that encapsulates the Q&A chain. Here is the full module:

PYTHON
# scripts/llm.py
from langchain_ollama import ChatOllama
from langchain_core.prompts import (SystemMessagePromptTemplate,
                                    HumanMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser

base_url = "http://localhost:11434"
model = 'qwen3'
llm = ChatOllama(base_url=base_url, model=model)

system = SystemMessagePromptTemplate.from_template(
    "You are helpful AI assistant who answer user question based on the provided context."
)

prompt = """Answer user question based on the provided context ONLY! If you do not know the answer, just say "I don't know".
            ### Context:
            {context}

            ### Question:
            {question}

            ### Answer:"""

prompt = HumanMessagePromptTemplate.from_template(prompt)
template = ChatPromptTemplate([system, prompt])
qna_chain = template | llm | StrOutputParser()

def ask_llm(context, question):
    return qna_chain.invoke({'context': context, 'question': question})

Import it from notebooks with from scripts import llm, then call llm.ask_llm(context, question).


Project 1: PowerPoint Speaker Script Generator

UnstructuredPowerPointLoader extracts all text elements from a .pptx file. Setting mode="elements" returns individual text blocks (title, body, bullet) each as a separate Document, with metadata["page_number"] indicating which slide they belong to.

PYTHON
import nltk
nltk.download('punkt')
OUTPUT
True

Note

The unstructured library requires NLTK punkt tokenizer data. nltk.download('punkt') downloads it on first use. If you encounter OSError: No such file or directory: .../punkt/PY3_tab, rename the downloaded PY3 folder to PY3_tab inside nltk_data/tokenizers/punkt/.

PYTHON
from langchain_community.document_loaders import UnstructuredPowerPointLoader

loader = UnstructuredPowerPointLoader("data/ml_course.pptx", mode="elements")
docs = loader.load()

len(docs)
OUTPUT
47
PYTHON
doc = docs[0]
doc.page_content
OUTPUT
'Machine Learning Model Deployment'

Group all text elements by slide number into a dict:

PYTHON
ppt_data = {}
for doc in docs:
    page = doc.metadata["page_number"]
    ppt_data[page] = ppt_data.get(page, "") + "\n\n" + doc.page_content

The resulting ppt_data dict (abbreviated) maps each slide number to its combined text:

PYTHON
ppt_data
OUTPUT
{1: '\n\nMachine Learning Model Deployment\n\nIntroduction to ML Pipeline\n\nhttps://bit.ly/bert_nlp\n\n', 2: '\n\nWhat is Machine Learning Pipeline?\n\n', 3: '\n\nType of ML Deployment\n\nBatch: In batch deployment, ML models process large volumes of data at scheduled intervals...\nStream: Stream deployment enables ML models to process and analyze data in real-time...\nRealtime: Realtime deployment allows ML models to provide instant predictions...\nEdge: Edge deployment involves running ML models on local devices close to the data source...\n\n', 4: '\n\nInfrastructure and Integration\n\nHardware and Software: Setting up the right environment...\nIntegration: Seamlessly integrating the model with existing systems...\n\n', 5: '\n\nBenefits of Deploying ML Models\n\nFocus on new models, not maintaining existing... || Prevention of bugs || Creation of records for debugging...\n\n', 6: '\n\nChallenges in ML Deployment\n\nData Management, Model Scalability and Performance, Integration with Existing Systems...\n\n', ...}

Build a structured context string with slide headers:

PYTHON
context = ""
for page, content in ppt_data.items():
    context += f"### Slide {page}:\n\n{content.strip()}\n\n\n"

Generate a 2-minute speaker script for every slide:

PYTHON
from scripts import llm

question = """
For each PowerPoint slide provided above, write a 2-minute script that effectively conveys the key points.
Ensure a smooth flow between slides, maintaining a clear and engaging narrative.
"""

response = llm.ask_llm(context, question)

Save the script:

PYTHON
with open("data/ppt_script.md", "w") as f:
    f.write(response)

Project 2: Excel Data Analysis

Note

LLMs are not reliable for mathematical calculations or aggregate analytics. Use them only for reading, formatting, and filtering tabular data, not for computing sums or averages.

PYTHON
from langchain_community.document_loaders import UnstructuredExcelLoader

loader = UnstructuredExcelLoader("data/sample.xlsx", mode="elements")
docs = loader.load()

doc = docs[0]
doc.page_content
OUTPUT
'First Name Last Name City Gender Brandon James Miami M Sean Hawkins Denver M Judy Day Los Angeles F Ashley Ruiz San Francisco F Stephanie Gomez Portland F'

The metadata dict contains the full HTML representation of the sheet:

PYTHON
context = doc.metadata['text_as_html']
context
HTML
<table><tr><td>First Name</td><td>Last Name</td><td>City</td><td>Gender</td></tr><tr><td>Brandon</td><td>James</td><td>Miami</td><td>M</td></tr><tr><td>Sean</td><td>Hawkins</td><td>Denver</td><td>M</td></tr><tr><td>Judy</td><td>Day</td><td>Los Angeles</td><td>F</td></tr><tr><td>Ashley</td><td>Ruiz</td><td>San Francisco</td><td>F</td></tr><tr><td>Stephanie</td><td>Gomez</td><td>Portland</td><td>F</td></tr></table>

Ask the LLM to format it as Markdown:

PYTHON
question = "Return this data in Markdown format."
response = llm.ask_llm(context, question)
print(response)
OUTPUT
| First Name | Last Name | City          | Gender |
|------------|-----------|---------------|--------|
| Brandon    | James     | Miami         | M      |
| Sean       | Hawkins   | Denver        | M      |
| Judy       | Day       | Los Angeles   | F      |
| Ashley     | Ruiz      | San Francisco | F      |
| Stephanie  | Gomez     | Portland      | F      |

Filter rows in natural language:

PYTHON
question = "Return all entries in the table where Gender is 'F'. Format the response in Markdown. Do not write preambles and explanation."
response = llm.ask_llm(context, question)
print(response)
OUTPUT
| First Name | Last Name | City          | Gender |
|------------|-----------|---------------|--------|
| Judy       | Day       | Los Angeles   | F      |
| Ashley     | Ruiz      | San Francisco | F      |
| Stephanie  | Gomez     | Portland      | F      |
PYTHON
question = "Return all entries in the table where Gender is 'male'. Format the response in Markdown. Do not write preambles and explanation."
response = llm.ask_llm(context, question)
print(response)
OUTPUT
| First Name | Last Name | City   | Gender |
|------------|-----------|--------|--------|
| Brandon    | James     | Miami  | M      |
| Sean       | Hawkins   | Denver | M      |

Project 3: Personalized Job Application Letter

Docx2txtLoader reads .docx Word documents and returns the full text content as a single Document:

PYTHON
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("data/job_description.docx")
docs = loader.load()

context = docs[0].page_content

Pass the job description as context and the applicant's details as the question:

PYTHON
question = """
My name is Aaditya, and I am a recent graduate from IIT with a focus on Natural Language Processing and Machine Learning.
I am applying for a Data Scientist position at SpiceJet.
Please write a concise job application email for me in short, removing any placeholders, including references to job boards or sources.
"""

response = llm.ask_llm(context, question)
print(response)
OUTPUT
Subject: Application for Data Scientist Position

Dear SpiceJet Team,

My name is Aaditya, a recent graduate from IIT with a focus on Natural Language Processing and Machine Learning. I am applying for the Data Scientist role at SpiceJet, as outlined in your job description.

With a strong foundation in data science, machine learning, and NLP, I am eager to contribute to your mission of leveraging data to drive revenue growth, reduce costs, and enhance customer experiences. My proficiency in Python, R, SQL, and tools like Tableau aligns with your requirements, and I have experience in predictive modeling and statistical analysis.

I am particularly drawn to SpiceJet's emphasis on collaboration with product teams and deploying models to automate processes...

Best regards,
Aaditya

MarkitDown — Microsoft's Universal File Converter

MarkitDown is a Microsoft open-source library that converts virtually any file format to Markdown — PDFs, DOCX, PPTX, XLSX, images, audio files, and even YouTube video metadata. It is distinct from LangChain document loaders: it produces structured Markdown output rather than raw text Document objects.

Install: pip install markitdown
GitHub: https://github.com/microsoft/markitdown

PYTHON
import warnings
warnings.filterwarnings("ignore")

from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()

file_path = Path("data/Apple-10-Q-2025-Q1.pdf")
file_path.stem
OUTPUT
'Apple-10-Q-2025-Q1'

Convert and save to a Markdown file:

PYTHON
import os
os.makedirs("markitdown", exist_ok=True)

result = md.convert(file_path)

with open(f"markitdown/{file_path.stem}.md", "w", encoding="utf-8") as f:
    f.write(result.text_content)

Preview the first 500 characters of the converted output:

PYTHON
print(result.text_content[:500])
PYTHON
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

FORM 10-Q

(Mark One)

☑    QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the quarterly period ended December 28, 2024
or
☐    TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from             to            .

Commission File Number: 001-36743

Apple Inc.

The SEC 10-Q filing's formatting, headers, and structure are preserved in Markdown.

Reusable Convert-and-Save Helper

PYTHON
def convert_and_save(file_path):
    file_path = Path(file_path)
    result = md.convert(file_path)

    with open(f"markitdown/{file_path.stem}.md", "w", encoding="utf-8") as f:
        f.write(result.text_content)

Convert all supported file types with the same function:

PYTHON
convert_and_save("data/Apple-10-Q-2025-Q1.pdf")   # PDF
convert_and_save("data/job_description.docx")       # Word
convert_and_save("data/ml_course.pptx")             # PowerPoint
convert_and_save("data/sample.xlsx")                # Excel

Converting YouTube Video Metadata

MarkitDown can also fetch YouTube video metadata (title, description, keywords) from a URL:

PYTHON
md = MarkItDown()

result = md.convert("https://www.youtube.com/watch?v=De66dBYqQWI")

print(result.text_content[:500])
PYTHON
# YouTube

## Deploy OpenAI Agent Builder Workflow in Production at Your Website | Integrate Chatbot in Website

### Video Metadata
- **Keywords:** kgp talkie, kgp talkie videos, kgp talkie ml, openai agent builder, chatbot deployment, website chatbot integration, vercel deployment, chatkit tutorial, vector database chatbot, openai api tutorial, ai chatbot development, agent builder workflow...

Tip

Use MarkitDown as a preprocessing step before LangChain loaders — convert a complex PDF or PPTX to clean Markdown first, then load the Markdown file with UnstructuredMarkdownLoader for more reliable chunking and retrieval.


Docling — IBM's Structured Document Converter

Docling is IBM's open-source document intelligence library that converts PDFs and other files into structured formats (Markdown, HTML, JSON) with deep understanding of layout, tables, figures, and OCR text. Unlike MarkitDown, Docling uses ML models to understand document structure and can extract tables as DataFrames and figures as images.

GitHub: https://github.com/docling-project/docling

PYTHON
from docling.document_converter import DocumentConverter
import os
from pathlib import Path

converter = DocumentConverter()

Basic Markdown Extraction

Convert the Apple 10-Q PDF:

PYTHON
file_path = Path("data/Apple-10-Q-2025-Q1.pdf")
result = converter.convert(file_path)

Docling processes the document using its ML pipeline (layout analysis, OCR via RapidOCR, table detection). Processing time varies with GPU availability — the example ran in ~17 seconds on a CUDA-enabled machine.

Export to Markdown and save:

PYTHON
os.makedirs("docling", exist_ok=True)

markdown = result.document.export_to_markdown()
with open(f"docling/{file_path.stem}.md", "w", encoding="utf-8") as f:
    f.write(markdown)

Export to HTML as well:

PYTHON
html = result.document.export_to_html()
with open(f"docling/{file_path.stem}.html", "w", encoding="utf-8") as f:
    f.write(html)

Reusable Docling Convert Helper

PYTHON
def convert_and_save(file_path):
    file_path = Path(file_path)
    result = converter.convert(file_path)

    markdown = result.document.export_to_markdown()
    with open(f"docling/{file_path.stem}.md", "w", encoding="utf-8") as f:
        f.write(markdown)

Docling handles multiple file types with format-specific pipelines:

PYTHON
convert_and_save("data/scansmpl.pdf")                        # Scanned PDF (uses OCR)
convert_and_save("data/job_description.docx")                # Word document
convert_and_save("data/sample.xlsx")                         # Excel spreadsheet
convert_and_save("data/finance and health rag system.jpg")   # Image file

Each format uses the appropriate Docling pipeline. The scanned PDF processes through OCR; DOCX uses a SimplePipeline; XLSX extracts sheet data; images run through vision-based analysis.

Extracting Tables to CSV

Docling identifies all tables in a PDF and can export each one to a pandas DataFrame, then to CSV:

PYTHON
os.makedirs("docling/tables", exist_ok=True)

converter = DocumentConverter()
file_path = Path("data/Apple-10-Q-2025-Q1.pdf")
result = converter.convert(file_path)

for i, table in enumerate(result.document.tables):
    df = table.export_to_dataframe()
    df.to_csv(f"docling/tables/{file_path.stem}_table_{i+1}.csv", index=False)

The Apple 10-Q contains financial tables (income statements, balance sheets, cash flow) — each is saved as a separate numbered CSV file.

Note

table.export_to_dataframe() without a doc argument is deprecated in newer Docling versions. Pass table.export_to_dataframe(doc=result.document) to suppress the deprecation message.

Extracting Figures as Images

Enable figure extraction by configuring PdfPipelineOptions before converting:

PYTHON
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption

os.makedirs("docling/figures", exist_ok=True)

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.images_scale = 3.0

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

images_scale = 3.0 renders images at 3× resolution for high-quality extraction. Convert a textbook PDF and save all figures:

PYTHON
file_path = Path("data/sample_textbook.pdf")
result = converter.convert(file_path)

for i, picture in enumerate(result.document.pictures):
    image = picture.get_image(result.document)
    image.save(f"docling/figures/{file_path.stem}_figure_{i+1}.png")

Each detected figure or chart from the PDF is saved as a separate numbered PNG file.


Loader Comparison

Loader Library Best For Output
PyMuPDFLoader langchain-community Digital PDFs — fast, metadata-rich list[Document]
WebBaseLoader langchain-community Scraping multiple webpages list[Document]
UnstructuredPowerPointLoader unstructured PPTX slide content by page list[Document]
UnstructuredExcelLoader unstructured XLSX with HTML table in metadata list[Document]
Docx2txtLoader docx2txt Word .docx plain text list[Document]
MarkItDown markitdown Any file → clean Markdown (PDF, DOCX, PPTX, XLSX, images, YouTube) str
Docling docling Structured extraction — tables as CSV, figures as PNG, Markdown/HTML/JSON ConversionResult

What You Built

This lesson covered the complete document loading layer of a RAG or document-processing pipeline:

  • PDF loaders — load one PDF or an entire folder, count tokens with tiktoken, and wire to Q&A, summarization, and report chains
  • Webpage loaders — scrape multiple URLs asynchronously, clean raw text with regex, chunk for large contexts, and summarize market news per chunk
  • Office loaders — extract slide text for speaker scripts (PPTX), query tabular data in natural language (XLSX), and generate personalized emails from job descriptions (DOCX)
  • MarkitDown — Microsoft's single-function converter that turns any file type into clean Markdown, including YouTube video metadata
  • Docling — IBM's ML-powered extractor that produces structured Markdown, HTML, per-table CSVs, and per-figure PNGs from complex PDFs

All loaders produce content that can be passed directly to any LCEL chain via format_docs(). The next step is chunking and embedding that content into a vector store for retrieval-augmented generation.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments