#langchain#linkedin#selenium#beautifulsoup#web-scraping#llm#profile-parsing#json#qwen3#ollama#python

LinkedIn Profile Scraping with LLM

Scrape a LinkedIn profile with Selenium and BeautifulSoup, clean and deduplicate raw HTML sections, then use a two-pass LLM pipeline to extract and structure profile data as JSON.

Jun 4, 2026 at 10:30 AM10 min readFollowFollow (Hindi)

Topics You Will Master

Automating LinkedIn login with Selenium WebDriver and environment variable credentials
Extracting the rendered HTML of a LinkedIn profile page with page_source
Parsing the DOM with BeautifulSoup to find all artdeco-card section elements
Cleaning raw web text with regex — removing excess newlines, tabs, and whitespace
Detecting and removing LinkedIn's UI duplication pattern with a remove_duplicates() function
Extracting section keys dynamically from the first line of each section's text
Building a first LLM pass — extracting up to 5 bullet points per section from raw text
Building a second LLM pass — re-parsing the section-level JSON dict into a clean structured profile schema
Saving the raw section data and the final structured profile to a JSON file
Best For

Developers who want to automate LinkedIn profile extraction for recruiting pipelines, competitive analysis, or portfolio aggregation — without relying on the LinkedIn API.

Expected Outcome

A linkedin_profile_data.json file containing structured profile data (Name, Headline, About, Experience, Education, Skills, Projects, Summary) extracted from a live LinkedIn profile via a two-pass LLM pipeline.

LinkedIn does not provide a public API for profile data. This lesson builds a scraper using Selenium (for browser automation and JavaScript rendering) and BeautifulSoup (for HTML parsing), then uses a two-stage LLM pipeline to extract and structure the raw text into a clean JSON profile.

Prerequisites: selenium, beautifulsoup4, lxml, langchain-ollama, langchain-core, python-dotenv installed. Chrome browser and ChromeDriver matching your Chrome version. EMAIL and PASSWORD set in .env.

BASH
pip install selenium beautifulsoup4 python-dotenv

Important

Scraping LinkedIn violates their Terms of Service. Use this only for educational purposes with your own account and profile. LinkedIn actively detects automated access and may restrict or ban accounts.

LangChain & Ollama — Local AI Development

Build production-ready LLM apps entirely on your own hardware. No API keys, no cloud costs.

Enroll on Udemy →

Setup

PYTHON
import warnings
warnings.filterwarnings("ignore")

import os
from dotenv import load_dotenv
load_dotenv()
OUTPUT
True

Selenium Browser Automation

Launching Chrome

PYTHON
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

This opens a Chrome browser window controlled by Selenium.

PYTHON
driver.get('https://www.linkedin.com/login')
driver.title
OUTPUT
'LinkedIn Login, Sign in | LinkedIn'

Logging In with Credentials from .env

Credentials are stored in .env (never hardcoded):

PYTHON
email = driver.find_element(By.ID, 'username')
email.send_keys(os.getenv('EMAIL'))

password = driver.find_element(By.ID, 'password')
password.send_keys(os.getenv('PASSWORD'))

password.submit()

After submit(), LinkedIn processes the login. If 2FA is enabled on your account, you may need to handle it manually in the browser window before the next step.

PYTHON
## MAKE SURE TO USE ONLY THIS URL FORMAT TO AVOID BEING STUCK IN ERRORS

url = "https://www.linkedin.com/in/laxmimerit"
driver.get(url)

Note

Always use the canonical /in/username URL format. Using other LinkedIn URLs (e.g., search pages or redirect URLs) can cause the scraper to land on unexpected pages.


HTML Extraction and Parsing

Getting the Page Source

PYTHON
page_source = driver.page_source

page_source captures the fully rendered DOM — including JavaScript-rendered content that would be missing from a plain HTTP request.

Parsing with BeautifulSoup

PYTHON
soup = BeautifulSoup(page_source, 'lxml')

Finding the Profile Main Container

LinkedIn's profile content is inside a <main> tag with a specific CSS class:

PYTHON
profile = soup.find('main', {'class': 'IDbhLWzXdzKoCEksNaayQTAEeGRjvNDI'})

Warning

LinkedIn frequently changes its CSS class names. The class IDbhLWzXdzKoCEksNaayQTAEeGRjvNDI was correct at the time of recording. If profile returns None, inspect the page source to find the current main container class.

Extracting Profile Sections

LinkedIn structures each profile section (About, Experience, Education, etc.) as <section class="artdeco-card"> elements:

PYTHON
sections = profile.find_all('section', {'class': 'artdeco-card'})
len(sections)
OUTPUT
20

20 sections found on the profile page.

Converting Sections to Text

PYTHON
sections_text = [section.get_text() for section in sections]

Text Cleaning

Removing Excess Whitespace

Raw LinkedIn HTML text contains excessive newlines, tabs, and mixed whitespace from the rendered DOM:

PYTHON
import re

def clean_text(text):
    text = re.sub(r'\n+', '\n', text)       # collapse multiple newlines into one
    text = re.sub(r'\t+', '\t', text)       # collapse multiple tabs into one
    text = re.sub(r'\t\s+', ' ', text)      # replace tab+spaces with a single space
    text = re.sub(r'\n\s+', '\n', text)     # remove leading spaces after newlines
    return text

sections_text = [clean_text(section) for section in sections_text]

Removing LinkedIn UI Duplication

LinkedIn duplicates certain strings in its UI (e.g., section headers appear twice in the raw text). The remove_duplicates() function detects this pattern — if the first half of a line equals the second half, keep only the first half:

PYTHON
def remove_duplicates(text):
    lines = text.split('\n')
    new_lines = []
    for line in lines:
        if line[:len(line)//2] == line[len(line)//2:]:
            new_lines.append(line[:len(line)//2])
        else:
            new_lines.append(line)
    return '\n'.join(new_lines)

sections_text = [remove_duplicates(section) for section in sections_text]

Sample cleaned section output:

PYTHON
print(sections_text[1])
OUTPUT
Open to work
Chief Data Scientist, Head of Data Science, Data Science Vice President, Lead Data Scientist and Data Scientist roles
Show details
Edit

Sample Raw Section Data

sections_text[0] contains the full profile header:

PLAINTEXT
Laxmi Kant has a {:badgeType} account
Laxmi Kant Tiwari
Gen AI in Finance & Investment Services | Data Scientist | IIT Kharagpur | Asset Management | AI-Driven Financial Modeling | Search Ranking | NLP Python BERT AWS Elasticsearch GNN SQL LLM | AI in Investment Strategies
Indian Institute of Technology, Kharagpur
Mumbai, Maharashtra, India
Contact info
27,018 followers
500+ connections
Open to
Add profile section
...

First LLM Pass — Section-Level Extraction

LLM and Chain Setup

PYTHON
from langchain_ollama import ChatOllama
from langchain_core.prompts import (SystemMessagePromptTemplate,
                                    HumanMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser

base_url = "http://localhost:11434"
model = 'qwen3'

llm = ChatOllama(base_url=base_url, model=model)

system = SystemMessagePromptTemplate.from_template(
    """You are helpful AI assistant who answer LinkedIn profile parsing related
    user question based on the provided profile text data."""
)

def ask_llm(prompt):
    prompt = HumanMessagePromptTemplate.from_template(prompt)
    messages = [system, prompt]
    template = ChatPromptTemplate(messages)
    qna_chain = template | llm | StrOutputParser()
    return qna_chain.invoke({})

Test the connection:

PLAINTEXT
llm
PYTHON
ChatOllama(model='qwen3', base_url='http://localhost:11434')
PYTHON
ask_llm("hello")
OUTPUT
'Hello! How can I assist you with LinkedIn profile parsing today? Are you looking to extract specific information, automate data collection, or need guidance on best practices? Let me know your query! 😊'

Extraction Prompt Template

The template injects the section's raw text and the key to extract, then asks for up to 5 bullet points:

PYTHON
template = """
Extract and return the requested information from the LinkedIn profile data in a concise, point-by-point format (up to 5 points). Avoid preambles or any additional context.

### LinkedIn Profile Data:
{}

### Information to Extract:
Extract '{}' in bullet points, limiting the output to 5 points. Provide only the necessary details.
Remember, It is LinkedIn profile data.

### Extracted Data:"""

Test on the first section:

PYTHON
context = sections_text[0]
k = "Name and Headline"

prompt = template.format(context, k).replace('{', '{{').replace('}', '}}')
response = ask_llm(prompt)
print(response)
OUTPUT
- Name: Laxmi Kant Tiwari
- Headline: Gen AI in Finance & Investment Services | Data Scientist
- Education: Indian Institute of Technology, Kharagpur
- Skills: NLP, Python, BERT, AWS, Elasticsearch, GNN, SQL, LLM
- Focus Area: AI in Investment Strategies

Note

The .replace('{', '{{').replace('}', '}}') is necessary because ChatPromptTemplate uses curly-brace syntax. Escaping the braces prevents LangChain from treating the section text content as template variables.

Extracting Section Keys Automatically

Instead of hardcoding section names, extract them from the first line of each section:

PYTHON
section_keys = ['Name and Headline']
for section in sections_text[1:]:
    section_keys.append(section.strip().split('\n')[0])

section_keys
OUTPUT
['Name and Headline', 'Open to work', "Tell non-profits you're interested in getting involved with your time and skills", 'Suggested for you', 'Analytics', 'About', 'Featured', 'Activity', 'Experience', 'Education', 'Licenses & certifications', 'Projects', 'Skills', 'Recommendations', 'Patents', 'Courses', 'Honors & awards', 'Languages', 'Interests', 'Causes']

Processing All 20 Sections

PYTHON
responses = {}

for k, context in zip(section_keys, sections_text):
    prompt = template.format(context, k).replace('{', '{{').replace('}', '}}')
    response = ask_llm(prompt=prompt)
    responses[k] = response

The responses dict maps each section key to its extracted bullet points. Abbreviated output:

PYTHON
print(responses)
OUTPUT
{'Name and Headline': '- Name: Laxmi Kant Tiwari\n- Headline: Data Scientist | Gen AI in Finance & Investment Services\n- Headline: Asset Management | AI-Driven Financial Modeling\n- Headline: AI in Investment Strategies\n- Headline: NLP Python BERT AWS Elasticsearch GNN SQL LLM', 'Open to work': '- Open to work', 'Analytics': '- 418 profile views\n- 2,950 post impressions\n- 129 search appearances\n- Past 7 days\n- Show all analytics', 'About': '- Demonstrated 8+ years of expertise in Advanced Analytics and Machine Learning, leading strategic AI initiatives at Linedata.\n- Specializes in designing scalable GenAI/LLM applications for financial workflows...\n- Developed AI-powered reconciliation system for cash/position breaks...\n- Built aliasing framework for securities/accounts mapping...\n- Created LLM-driven financial memo generator and research assistants...', 'Experience': '- **Senior Manager, Linedata** (Sep 2024 – Present)\n  Developed AI-powered financial solutions using LLMs, custom algorithms...\n- **Assistant Vice President, IGP** (Oct 2023 – Sep 2024)\n  Led projects including customer behavior modeling...\n- **Data Science Manager, IGP.COM** (Dec 2019 – Apr 2021)\n  Implemented AI-powered search with BERT algorithms, boosting search conversion by 20% and revenue by 25%...', 'Education': '- Senior Research Scholar, Computer Science, IIT Kharagpur (2014–2016)\n- M.Tech., Computer Science, IIT Kharagpur (2012–2014)\n- Signal processing algorithms for respiration and heart rate monitoring\n- Machine Learning algorithms and Android app for Sleep Apnea detection\n- 5-stage pipelined RISC processor in Verilog (term project)', ...}

Save to JSON:

PYTHON
import json

with open('linkedin_profile_data.json', 'w') as f:
    json.dump(responses, f, indent=4)

Second LLM Pass — Structured Profile Parsing

The first pass produced a dict of section-level bullet points. The second pass re-parses this dict into a clean, structured schema:

PYTHON
template = """You are provided with LinkedIn profile data in JSON format.
            Parse the data according to the specified schema, correct any spelling errors,
            and condense the information if possible.

### LinkedIn Profile JSON Data:
{context}

### Schema You need to follow:
You need to extract
Name:
Headline:
About:
Experience:
Education:
Skills:
Projects:
Summary:

Do not return preambles or any other information.
### Parsed Data:"""

prompt = template.format(context=responses).replace("{", "{{").replace("}", "}}")
response = ask_llm(prompt=prompt)
print(response)
JSON
{
  "Name": "Laxmi Kant Tiwari",
  "Headline": "Data Scientist | Gen AI in Finance & Investment Services | AI-Driven Financial Modeling | NLP, Python, BERT, AWS, Elasticsearch, GNN, SQL, LLM",
  "About": "Demonstrated 8+ years of expertise in Advanced Analytics and Machine Learning, leading strategic AI initiatives at Linedata. Specializes in designing scalable GenAI/LLM applications for financial workflows to transform legacy processes into intelligent systems.",
  "Experience": [
    {
      "Title": "Senior Manager",
      "Company": "Linedata",
      "Dates": "Sep 2024 – Present",
      "Responsibilities": [
        "Developed AI-powered financial solutions using LLMs, custom algorithms, and scalable architectures",
        "Created intelligent reconciliation systems, security frameworks, and covenant analysis tools"
      ]
    },
    {
      "Title": "Assistant Vice President",
      "Company": "IGP",
      "Dates": "Oct 2023 – Sep 2024",
      "Responsibilities": [
        "Led projects including customer behavior modeling, hashtag search optimization, automated feedback reports, and meta tag generators"
      ]
    },
    {
      "Title": "Associate Vice President",
      "Company": "IGP",
      "Dates": "Apr 2021 – Oct 2023",
      "Responsibilities": [
        "Engineered personalized search algorithms, graph/big data analysis tools, and social graph-based recommendation systems"
      ]
    },
    {
      "Title": "Data Science Manager",
      "Company": "IGP.COM",
      "Dates": "Dec 2019 – Apr 2021",
      "Responsibilities": [
        "Implemented AI-powered search with BERT algorithms, real-time product recommendations, boosting search conversion by 20% and revenue by 25%"
      ]
    },
    {
      "Title": "Co-Founder",
      "Company": "mBreath",
      "Dates": "Aug 2016 – Nov 2019",
      "Responsibilities": [
        "Developed wearable health tech with patents for sleep apnea detection, respiration monitoring, and environmental sound analysis using ML models (CNN, LSTM)"
      ]
    }
  ],
  "Education": [
    {
      "Degree": "Senior Research Scholar, Computer Science",
      "Institution": "IIT Kharagpur",
      "Dates": "2014–2016"
    },
    {
      "Degree": "M.Tech., Computer Science",
      "Institution": "IIT Kharagpur",
      "Dates": "2012–2014"
    }
  ],
  "Skills": ["Finance", "HuggingFace", "NLP", "Python", "BERT", "AWS", "Elasticsearch", "GNN", "SQL", "LLM"],
  "Projects": [...]
}

Two-Pass Pipeline Summary

PYTHON
Raw LinkedIn HTML
     │
     ▼
BeautifulSoup → 20 section elements
     │
     ▼
clean_text() → remove excess whitespace
remove_duplicates() → remove LinkedIn UI duplication
     │
     ▼
section_keys[] ← first line of each section as key
     │
     ▼
Pass 1: LLM extracts up to 5 bullets per section → responses{}
     │
     ▼
Save: linkedin_profile_data.json (section-level bullets)
     │
     ▼
Pass 2: LLM re-parses responses{} → structured JSON schema
        (Name, Headline, About, Experience, Education, Skills, Projects, Summary)

What You Built

In this lesson you built a complete LinkedIn profile extraction pipeline:

  • Selenium automation — browser login with .env credentials, profile page navigation, JavaScript-rendered page capture
  • BeautifulSoup parsingartdeco-card section extraction from the rendered DOM
  • Text cleaning — regex-based whitespace normalization and LinkedIn UI duplication removal
  • Dynamic section key extraction — automatically derives section names from first lines
  • First LLM pass — extracts structured bullet points per section from raw HTML text
  • Second LLM pass — converts section-level bullets into a clean JSON profile schema (Name, Headline, Experience, Education, Skills)
  • JSON persistencelinkedin_profile_data.json saved after both passes

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments