#langchain#resume-parsing#pdf-extraction#pymupdf#json-output-parser#llm#qwen3#ollama#python

Resume Parsing with LangChain and LLMs

Extract structured data from PDF resumes using PyMuPDF, process the text with an LLM, and guarantee valid JSON output using LangChain's JsonOutputParser in a two-stage validation pipeline.

Jun 4, 2026 at 10:30 AM5 min readFollowFollow (Hindi)

Topics You Will Master

Loading and extracting text from PDF resumes using PyMuPDFLoader
Designing a highly structured extraction prompt with specific schema requirements
Building a modular LLM pipeline using ChatOllama and LCEL (|)
Implementing a two-stage LLM pipeline:
Saving the final parsed resume data as a clean, machine-readable JSON file
Best For

Developers building HR tech, applicant tracking systems (ATS), or automated screening tools who need to convert unstructured PDF resumes into structured JSON data.

Expected Outcome

A modular Python script (llm.py) and a Jupyter Notebook that together load a PDF resume, extract contact info, education, experience, and skills, and save the result as a strictly validated JSON file.

Parsing resumes is historically difficult because every candidate uses a different layout. Traditional regex or rule-based parsers break easily. LLMs solve this by reading the document semantically. In this lesson, we extract raw text from a PDF resume using PyMuPDFLoader and use a two-stage LLM pipeline (extraction + validation) to ensure the output is perfectly formatted JSON.

Prerequisites: langchain, langchain-ollama, langchain-core, pymupdf, python-dotenv installed. Ollama running with qwen3. A sample resume PDF in a resume/ directory.

BASH
pip install -U langchain langchain-ollama langchain-core pymupdf python-dotenv

LangChain & Ollama — Local AI Development

Build production-ready LLM apps entirely on your own hardware. No API keys, no cloud costs.

Enroll on Udemy →

The scripts/llm.py Module

Instead of writing all the LLM logic in a notebook, we modularize it into a reusable llm.py script. This makes it easy to reuse the same logic in a web app later.

1. Model Setup

PYTHON
from langchain_ollama import ChatOllama
from langchain_core.prompts import (SystemMessagePromptTemplate,
                                    HumanMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser

base_url = "http://localhost:11434"
model = 'qwen3'

llm = ChatOllama(base_url=base_url, model=model)

system = SystemMessagePromptTemplate.from_template(
    """You are helpful AI assistant who answer user question based on the provided context."""
)

2. The Extraction Function (ask_llm)

The first pass extracts information into a structured text format. Notice how specific the prompt is about the required sections:

PYTHON
prompt = """
            **Task:** Extract key information from the following resume text.

            **Resume Text:**
            {context}

            **Instructions:**
            Please extract the following information and format it in a clear structure:

            1. **Contact Information:**
            - Name:
            - Email:
            - Phone Number:
            - Website/Portfolio:

            2. **Education:**
            - Institution Name:
            - Degree:
            - Field of Study:
            - Graduation Dates:

            3. **Experience:**
            - Job Title:
            - Company Name:
            - Location:
            - Dates of Employment:
            - Responsibilities/Projects:

            4. **Projects:**
            - Project Title:
            - Description/Technologies Used:
            - Outcomes/Results:

            5. **Skills:**
            - Programming Languages:
            - Technologies/Tools:

            6. **Additional Information:** (if applicable)
            - Certifications:
            - Awards or Honors:
            - Professional Affiliations:
            - Languages:

            **Question:**
            {question}

            **Extracted Information:**
        """

prompt = HumanMessagePromptTemplate.from_template(prompt)

def ask_llm(context, question):
    messages = [system, prompt]
    template = ChatPromptTemplate(messages)

    # Use StrOutputParser for the initial extraction pass
    qna_chain = template | llm | StrOutputParser()
    return qna_chain.invoke({'context': context, 'question': question})

3. The Validation Function (validate_json)

Even when instructed to return JSON, LLMs sometimes include conversational preambles (e.g., "Here is the JSON you requested:") or markdown backticks. The second pass uses JsonOutputParser to strictly enforce and parse a valid Python dictionary/JSON object from the first pass's output.

PYTHON
def validate_json(data):
    json_prompt = """
            Please validate and correct the following JSON data:

            **Extracted Information:**
            {data}

            Provide only the corrected JSON, with no preamble or explanation.

            **Corrected JSON:**"""

    json_prompt = HumanMessagePromptTemplate.from_template(json_prompt)
    json_messages = [system, json_prompt]
    json_template = ChatPromptTemplate(json_messages)

    # Use JsonOutputParser to guarantee a valid JSON object is returned
    json_chain = json_template | llm | JsonOutputParser()
    return json_chain.invoke({'data': data})

Running the Extraction in a Notebook

With the LLM logic encapsulated in scripts/llm.py, the notebook focuses strictly on document loading and execution.

Loading the PDF

PYTHON
from dotenv import load_dotenv
load_dotenv('./../.env')

from langchain_community.document_loaders import PyMuPDFLoader

filename = 'resume-1.pdf'

loader = PyMuPDFLoader('resume/{}'.format(filename))
docs = loader.load()

PyMuPDFLoader extracts text rapidly and accurately from PDFs. The resulting docs list contains a Document object for each page.

PYTHON
context = docs[0].page_content

question = """You are tasked with parsing a job resume. Your goal is to extract relevant information in a valid structured 'JSON' format.
                Do not write preambles or explanations."""

Executing the Two-Stage Pipeline

Import the functions from our custom module and run them sequentially:

PYTHON
from scripts.llm import ask_llm, validate_json

# Stage 1: Semantic Extraction (returns string)
response = ask_llm(context=context, question=question)

# Stage 2: JSON Validation (returns dict)
response = validate_json(response)

Inspecting the Parsed Data

PYTHON
print(response)
OUTPUT
{
    'Contact Information': {
        'Name': 'Kumar Pallav',
        'Email': 'me@kumarpallav.com',
        'Phone Number': '+1-206-910-0006',
        'Website/Portfolio': 'http://kumarpallav.com'
    },
    'Education': {
        'Institution Name': 'Indian Institute of Technology, Bombay',
        'Degree': 'Bachelor of Computer Science and Engineering (with Hons.)',
        'Field of Study': 'Computer Science and Engineering',
        'Graduation Dates': 'Jun 2010 - May 2014'
    },
    'Experience': [
        {
            'Job Title': 'Software Engineer · OneNote',
            'Company Name': 'Microsoft',
            'Location': 'Redmond, WA',
            'Dates of Employment': 'Jun 2016 - Present',
            'Responsibilities/Projects': [
                'Magic Ink and Ink Lookup: Recognizing ink strokes into words...',
                'Whiteboard App: Shared session via OneDrive for Business...'
            ]
        },
        ...
    ],
    'Projects': [...],
    'Skills': {
        'Programming Languages': ['C++', 'CSharp', 'JavaScript', 'Java', 'C'],
        'Technologies/Tools': ['NodeJs', 'UWP', 'Win32']
    }
}

The output is perfectly structured into a Python dictionary, matching the exact categories requested in the ask_llm prompt. Nested arrays are used correctly for lists of experiences and skills.

Saving to JSON

Finally, save the dictionary directly to a .json file for downstream use:

PYTHON
import json

output_file = filename.replace('.pdf', '.json')
output_file = 'parsed_resume/{}'.format(output_file)

json.dump(response, open(output_file, 'w'), indent=4)

Why Use a Two-Stage Pipeline?

You might wonder why we don't just use JsonOutputParser on the first call.

  1. Cognitive Load: Extracting complex data from dense, unstructured text is hard. Formatting that data perfectly as JSON at the same time increases the risk of hallucination or syntax errors.
  2. Separation of Concerns: Pass 1 focuses entirely on reading and extracting the data accurately. Pass 2 focuses entirely on formatting and escaping the data into JSON syntax.
  3. Reliability: This architecture dramatically reduces JSON parsing errors in production.

What You Built

In this lesson you built a production-ready resume parsing engine:

  • PDF Loading — Used PyMuPDFLoader to extract raw text from PDF resumes
  • Modular Pipeline — Encapsulated LangChain logic inside scripts/llm.py
  • Two-Stage Processing:
    • Stage 1: Extracted a highly structured textual summary using StrOutputParser
    • Stage 2: Validated and converted the text into a strict dictionary using JsonOutputParser
  • Data Persistence — Automatically saved the structured dictionary as a .json file

In the next lesson, we will deploy this exact pipeline as an interactive web application.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments