LinkedIn Profile Scrapper in Python
LinkedIn Profile Scrapping using Selenium and Beautiful Soup
Scraping of LinkedIn profiles is a very useful activity especially to achieve public relations/marketing tasks. In this project, we are going to scrap important data from a LinkedIn profile.
The first part of this project is to automatically log in to our LinkedIn account. For that, you will have to download a web driver
. You will also have to install Selenium
and beautifulsoup4
. Below are the commands which you need to run to download both the packages. You can also visit the given links for more information about installation.
For Selenium:-
!pip install selenium
https://pypi.org/project/selenium/
https://selenium-python.readthedocs.io/api.html
For beautifulsoup4:-
!pip install beautifulsoup4
https://pypi.org/project/beautifulsoup4/
Based on your Google Chrome version you can download the web driver from here. Save it in the working repository.
Lastly, in the config.txt
file you need to add your email id and LinkedIn password and save that file in the working repository.
For more details related to this you can watch this video.
You can even refer to this blog which gives a detailed explanation of the code.
Here we have imported the necessary libraries.
import requests, time, random from bs4 import BeautifulSoup from selenium import webdriver
Here we are getting the address of the Google Chrome driver using browser = webdriver.Chrome('driver/chromedriver.exe')
. Then we will open the LinkedIn login page using browser.get()
. We will open the config.txt
file which we have created and read the username
and password
from the file.
Now we have to automate the login process. For that, we will have to check the id
of the textboxes which accept the username and password on the webpage. We can do this by right-clicking anywhere on the webpage and then clicking on 'inspect'. After doing this you will see that the id
of the username textbox is username
and the id
of password textbox is password
.
find_element_by_id()
returns the first element with the id attribute value matching the location. send_keys()
method is used to send text to any field, such as input field of a form or even to anchor tag paragraph, etc. It replaces its contents on the webpage in your browser. submit()
method is used to submit a form after you have sent data to a form.
browser = webdriver.Chrome('driver/chromedriver.exe') browser.get('https://www.linkedin.com/uas/login') file = open('config.txt') lines = file.readlines() username = lines[0] password = lines[1] elementID = browser.find_element_by_id('username') elementID.send_keys(username) elementID = browser.find_element_by_id('password') elementID.send_keys(password) elementID.submit()
link
contains the link of the profile we want to scrap. You can scrap any profile of your choice or you can even scrap multiple links using a for
loop.
link = 'https://www.linkedin.com/in/rishabh-singh-61b706114/' browser.get(link)
Watch Video for this blog:
The whole profile doesn't get loaded at the start. Only the part which we can see is loaded. So we will have to scroll the profile till the end so that the complete profile is loaded. The code given below scrolls the profile till the end.
SCROLL_PAUSE_TIME = 5 # Get scroll height last_height = browser.execute_script("return document.body.scrollHeight") for i in range(3): # Scroll down to bottom browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(SCROLL_PAUSE_TIME) # Calculate new scroll height and compare with last scroll height new_height = browser.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height
Now as the full page is loaded, you are ready to get the page source. We will use the lxml
parser and the source code in a BeautifulSoup
object soup
.
src = browser.page_source soup = BeautifulSoup(src, 'lxml')
To extract anything from the webpage we will have to inspect the webpage. We can do this by right-clicking anywhere on the webpage and then clicking on 'inspect'.
The block containing the basic information is represented using the div tag with class name as flex-1 mr5
.
name_div = soup.find('div', {'class': 'flex-1 mr5'}) name_div
<div class="flex-1 mr5"> <ul class="pv-top-card--list inline-flex align-items-center"> <li class="inline t-24 t-black t-normal break-words"> Rishabh Singh </li> <li class="pv-top-card__distance-badge inline-block v-align-text-bottom t-16 t-black--light t-normal"><span class="distance-badge separator"> <span class="visually-hidden">3rd degree connection</span><span aria-hidden="true" class="dist-value">3rd</span> </span></li> <!-- --> <li class="inline-flex ml2"> <span class="pv-member-badge--for-top-card inline-flex pv-member-badge ember-view" id="ember102" style="display: none;"><!-- --> <!-- --> <span class="visually-hidden"> Rishabh has a account </span> <!-- --></span> </li> <!-- --> </ul> <h2 class="mt1 t-18 t-black t-normal break-words"> #futureshaper </h2> <ul class="pv-top-card--list pv-top-card--list-bullet mt1"> <li class="t-16 t-black t-normal inline-block"> Bengaluru, Karnataka, India </li> <!-- --> <li class="inline-block"> <span class="t-16 t-black t-normal"> 500+ connections </span> </li> <li class="inline-block"> <a class="ember-view" data-control-name="contact_see_more" href="/in/rishabh-singh-61b706114/detail/contact-info/" id="ember103"> <span class="t-16 link-without-visited-state"> Contact info </span> </a> </li> </ul> </div>
We will first get the name. As you can see name_div
there are 2 ul
tags. The first ul
consists of the name and the second ul
consists of the location and no. of connections.
Here we will first get both the ul
tags using name_div.find_all('ul')
. We will find the li
in the first ul
tag using name_loc[0].find('li')
and get the text enclosed in it using get_text()
.
name_loc = name_div.find_all('ul') name = name_loc[0].find('li').get_text().strip() name
'Rishabh Singh'
Simillarly, for the location we will find the li
in the second ul
tag.
loc = name_loc[1].find('li').get_text().strip() loc
'Bengaluru, Karnataka, India'
The profile title is enclosed in the h2
tag. So we can extract it using name_div.find('h2').get_text()
.
profile_title = name_div.find('h2').get_text().strip() profile_title
'#futureshaper'
The no. of connections is in 2nd li
of the 2nd ul
. Hence first we will find all the li
tags in the second ul
using name_loc[1].find_all('li')
. Then we will get the text from the second li
tag using connection[1].get_text()
.
connection = name_loc[1].find_all('li') connection = connection[1].get_text().strip() connection
'500+ connections'
We will append everything we have scrapped till now in info
.
info = [] info.append(link) info.append(name) info.append(profile_title) info.append(loc) info.append(connection) info
['https://www.linkedin.com/in/rishabh-singh-61b706114/', 'Rishabh Singh', '#futureshaper', 'Bengaluru, Karnataka, India', '500+ connections']
Experience
Now we will scrap the information under the experience section in the profile. We can access the experience section using the tag section
and id experience-section
.
exp_section = soup.find('section', {'id': 'experience-section'}) exp_section
<section class="pv-profile-section experience-section ember-view" id="experience-section"><header class="pv-profile-section__card-header"> <h2 class="pv-profile-section__card-heading"> Experience </h2> <!-- --></header> <ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-no-more"> <li class="pv-entity__position-group-pager pv-profile-section__list-item ember-view" id="ember166"> <section class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view" id="1517647779"> <div class="display-flex justify-space-between full-width"> <div class="display-flex flex-column full-width"> <a class="full-width ember-view" data-control-name="background_details_company" href="/company/honeywell/" id="ember168"> <div class="pv-entity__logo company-logo"> <img alt="Honeywell" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view" id="ember170" loading="lazy" src="https://media-exp1.licdn.com/dms/image/C560BAQFvcIh3UnA5zw/company-logo_100_100/0?e=1607558400&v=beta&t=dDiL6fU4CW1y7u-RdPOENVsnHqExUQVv9qs_lj14xBw"/> </div> <div class="pv-entity__summary-info pv-entity__summary-info--background-section"> <h3 class="t-16 t-black t-bold">FPGA Engineer</h3> <p class="visually-hidden">Company Name</p> <p class="pv-entity__secondary-title t-14 t-black t-normal"> Honeywell <!-- --> </p> <div class="display-flex"> <h4 class="pv-entity__date-range t-14 t-black--light t-normal"> <span class="visually-hidden">Dates Employed</span> <span>Aug 2019 – Present</span> </h4> <h4 class="t-14 t-black--light t-normal"> <span class="visually-hidden">Employment Duration</span> <span class="pv-entity__bullet-item-v2">1 yr 2 mos</span> </h4> </div> <h4 class="pv-entity__location t-14 t-black--light t-normal block"> <span class="visually-hidden">Location</span> <span>Bengaluru Area, India</span> </h4> <!-- --> </div> </a> <!-- --> </div> <!-- --> </div> </section> </li><li class="pv-entity__position-group-pager pv-profile-section__list-item ember-view" id="ember173"> <section class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view" id="929137672"> <div class="display-flex justify-space-between full-width"> <div class="display-flex flex-column full-width"> <a class="full-width ember-view" data-control-name="background_details_company" href="/company/l&t-technology-services-limited/" id="ember175"> <div class="pv-entity__logo company-logo"> <img alt="L&T Technology Services Limited" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view" id="ember177" loading="lazy" src="https://media-exp1.licdn.com/dms/image/C510BAQFFdHnl8nr2KA/company-logo_100_100/0?e=1607558400&v=beta&t=5GTSsopRboozxuQ6y1Y_LDixQebeP2KBYi39Z3jpOdA"/> </div> <div class="pv-entity__summary-info pv-entity__summary-info--background-section"> <h3 class="t-16 t-black t-bold">FPGA Design Engineer</h3> <p class="visually-hidden">Company Name</p> <p class="pv-entity__secondary-title t-14 t-black t-normal"> L&T Technology Services Limited <span class="pv-entity__secondary-title separator">Full-time</span> </p> <div class="display-flex"> <h4 class="pv-entity__date-range t-14 t-black--light t-normal"> <span class="visually-hidden">Dates Employed</span> <span>Jan 2017 – Jul 2019</span> </h4> <h4 class="t-14 t-black--light t-normal"> <span class="visually-hidden">Employment Duration</span> <span class="pv-entity__bullet-item-v2">2 yrs 7 mos</span> </h4> </div> <h4 class="pv-entity__location t-14 t-black--light t-normal block"> <span class="visually-hidden">Location</span> <span>Bengaluru Area, India</span> </h4> <!-- --> </div> </a> <!-- --> </div> <!-- --> </div> </section> </li> </ul> <!-- --></section>
From exp_section
we are going to get the first ul
tag. Then from the first ul
tag we are going to get the first div
tag. Then from the first div
tag we are going to get the first a
tag.
exp_section = exp_section.find('ul') div_tag = exp_section.find('div') a_tag = div_tag.find('a') a_tag
<a class="full-width ember-view" data-control-name="background_details_company" href="/company/honeywell/" id="ember168"> <div class="pv-entity__logo company-logo"> <img alt="Honeywell" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view" id="ember170" loading="lazy" src="https://media-exp1.licdn.com/dms/image/C560BAQFvcIh3UnA5zw/company-logo_100_100/0?e=1607558400&v=beta&t=dDiL6fU4CW1y7u-RdPOENVsnHqExUQVv9qs_lj14xBw"/> </div> <div class="pv-entity__summary-info pv-entity__summary-info--background-section"> <h3 class="t-16 t-black t-bold">FPGA Engineer</h3> <p class="visually-hidden">Company Name</p> <p class="pv-entity__secondary-title t-14 t-black t-normal"> Honeywell <!-- --> </p> <div class="display-flex"> <h4 class="pv-entity__date-range t-14 t-black--light t-normal"> <span class="visually-hidden">Dates Employed</span> <span>Aug 2019 – Present</span> </h4> <h4 class="t-14 t-black--light t-normal"> <span class="visually-hidden">Employment Duration</span> <span class="pv-entity__bullet-item-v2">1 yr 2 mos</span> </h4> </div> <h4 class="pv-entity__location t-14 t-black--light t-normal block"> <span class="visually-hidden">Location</span> <span>Bengaluru Area, India</span> </h4> <!-- --> </div> </a>
We can extract the job title using h3
tag.
job_title = a_tag.find('h3').get_text().strip() job_title
'FPGA Engineer'
The company name is enclosed by the 2nd p
tag. Hence we can get it by a_tag.find_all('p')[1].get_text()
.
company_name = a_tag.find_all('p')[1].get_text().strip() company_name
'Honeywell'
For the joining date we will extract the first h4
tag using a_tag.find_all('h4')[0]
. Then we will get the second span
from the first h4
using find_all('span')[1]
.
joining_date = a_tag.find_all('h4')[0].find_all('span')[1].get_text().strip() joining_date
'Aug 2019 – Present'
For the duration we will extract the second h4
tag using a_tag.find_all('h4')[1]
. Then we will get the second span
using find_all('span')[1]
.
exp = a_tag.find_all('h4')[1].find_all('span')[1].get_text().strip() exp
'1 yr 2 mos'
We will append all the scrapped data to info
.
info
['https://www.linkedin.com/in/rishabh-singh-61b706114/', 'Rishabh Singh', '#futureshaper', 'Bengaluru, Karnataka, India', '500+ connections']
info.append(company_name) info.append(job_title) info.append(joining_date) info.append(exp) info
['https://www.linkedin.com/in/rishabh-singh-61b706114/', 'Rishabh Singh', '#futureshaper', 'Bengaluru, Karnataka, India', '500+ connections', 'Honeywell', 'FPGA Engineer', 'Aug 2019 – Present', '1 yr 2 mos']
Education
Now we will move to the education section. We can extract it using the section
tag having id as education-section
. Then we will get the ul
tag which contains all the information.
edu_section = soup.find('section', {'id': 'education-section'}).find('ul') edu_section
<ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-no-more"> <li class="pv-profile-section__list-item pv-education-entity pv-profile-section__card-item ember-view" id="356637700"><div class="display-flex justify-space-between full-width"> <div class="display-flex flex-column full-width"> <a class="ember-view" data-control-name="background_details_school" href="/school/223389/?legacySchoolId=223389" id="ember183"> <div class="pv-entity__logo"> <img alt="Technocrats Institute of Technology (Excellence), Anand Nagar, PB No. 24, Post Piplani, BHEL, Bhopal - 462021" class="pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-4 lazy-image ghost-school ember-view" id="ember185" loading="lazy" src=""/> </div> <div class="pv-entity__summary-info pv-entity__summary-info--background-section"> <div class="pv-entity__degree-info"> <h3 class="pv-entity__school-name t-16 t-black t-bold">Technocrats Institute of Technology (Excellence), Anand Nagar, PB No. 24, Post Piplani, BHEL, Bhopal - 462021</h3> <p class="pv-entity__secondary-title pv-entity__degree-name t-14 t-black t-normal"> <span class="visually-hidden">Degree Name</span> <span class="pv-entity__comma-item">Bachelor of Engineering (B.E.)</span> </p> <p class="pv-entity__secondary-title pv-entity__fos t-14 t-black t-normal"> <span class="visually-hidden">Field Of Study</span> <span class="pv-entity__comma-item">Electrical, Electronics and Communications Engineering</span> </p> <p class="pv-entity__secondary-title pv-entity__grade t-14 t-black t-normal"> <span class="visually-hidden">Grade</span> <span class="pv-entity__comma-item">FIRST</span> </p> </div> <p class="pv-entity__dates t-14 t-black--light t-normal"> <span class="visually-hidden">Dates attended or expected graduation</span> <span> <time>2012</time> – <time>2016</time> </span> </p> <!-- --></div> </a> <!-- --> </div> <!-- --></div> </li> <li class="pv-profile-section__list-item pv-education-entity pv-profile-section__card-item ember-view" id="373985416"><div class="display-flex justify-space-between full-width"> <div class="display-flex flex-column full-width"> <a class="ember-view" data-control-name="background_details_school" href="/search/results/all/?keywords=S.H.S.B.B" id="ember188"> <div class="pv-entity__logo"> <img alt="S.H.S.B.B" class="pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-4 lazy-image ghost-school ember-view" id="ember190" loading="lazy" src=""/> </div> <div class="pv-entity__summary-info pv-entity__summary-info--background-section"> <div class="pv-entity__degree-info"> <h3 class="pv-entity__school-name t-16 t-black t-bold">S.H.S.B.B</h3> <!-- --> <p class="pv-entity__secondary-title pv-entity__fos t-14 t-black t-normal"> <span class="visually-hidden">Field Of Study</span> <span class="pv-entity__comma-item">PCM</span> </p> <!-- --> </div> <!-- --> <!-- --></div> </a> <!-- --> </div> <!-- --></div> </li> </ul>
We can get the name of the college directly using the h3
tag.
college_name = edu_section.find('h3').get_text().strip() college_name
'Technocrats Institute of Technology (Excellence), Anand Nagar, PB No. 24, Post Piplani, BHEL, Bhopal - 462021'
We will get the name of the degree from the second span
of the p
tag with class pv-entity__secondary-title pv-entity__degree-name t-14 t-black t-normal
.
degree_name = edu_section.find('p', {'class': 'pv-entity__secondary-title pv-entity__degree-name t-14 t-black t-normal'}).find_all('span')[1].get_text().strip() degree_name
'Bachelor of Engineering (B.E.)'
We will get the stream from the second span
of the p
tag with class pv-entity__secondary-title pv-entity__fos t-14 t-black t-normal
.
stream = edu_section.find('p', {'class': 'pv-entity__secondary-title pv-entity__fos t-14 t-black t-normal'}).find_all('span')[1].get_text().strip() stream
'Electrical, Electronics and Communications Engineering'
We will get the years of degree from the second span
of the p
tag with class pv-entity__dates t-14 t-black--light t-normal
.
degree_year = edu_section.find('p', {'class': 'pv-entity__dates t-14 t-black--light t-normal'}).find_all('span')[1].get_text().strip() degree_year
'2012 – 2016'
We will append everything we have scrapped in info
.
info
['https://www.linkedin.com/in/rishabh-singh-61b706114/', 'Rishabh Singh', '#futureshaper', 'Bengaluru, Karnataka, India', '500+ connections', 'Honeywell', 'FPGA Engineer', 'Aug 2019 – Present', '1 yr 2 mos']
info.append(college_name) info.append(degree_name) info.append(stream) info.append(degree_year) info
['https://www.linkedin.com/in/rishabh-singh-61b706114/', 'Rishabh Singh', '#futureshaper', 'Bengaluru, Karnataka, India', '500+ connections', 'Honeywell', 'FPGA Engineer', 'Aug 2019 – Present', '1 yr 2 mos', 'Technocrats Institute of Technology (Excellence), Anand Nagar, PB No. 24, Post Piplani, BHEL, Bhopal - 462021', 'Bachelor of Engineering (B.E.)', 'Electrical, Electronics and Communications Engineering', '2012 – 2016', 'Technocrats Institute of Technology (Excellence), Anand Nagar, PB No. 24, Post Piplani, BHEL, Bhopal - 462021', 'Bachelor of Engineering (B.E.)', 'Electrical, Electronics and Communications Engineering', '2012 – 2016']
We have scrapped all the important data from the LinkedIn profile. This same code can be used to scrap many more profiles.
Note:- The IDs and class name of the tags can change. Hence before running this code check the current IDs and class names of the tags used by inspecting the webpage.
5 Comments