In its simplest form, web scraping involves automating the process of collecting information available on the web, which can then be stored, analyzed, or used to fuel decision-making processes.
Now, you might be wondering, why LinkedIn? LinkedIn, as a professional networking platform, is a treasure trove of data. It hosts a wealth of information about professionals, companies, and job opportunities. For instance, recruiters might use it to find potential candidates, sales teams can identify potential leads, and researchers can use it for labor market analysis. The possibilities are endless.
In this tutorial, you’ll learn how to scrape data from LinkedIn using Beautiful Soup. After you learn about the process step-by-step, you’ll also learn about the Bright Data solution that makes scraping LinkedIn much faster.
Scraping LinkedIn in Python
In this tutorial, you’ll use Python to create a web scraper using free tools like Beautiful Soup and Requests. So let’s get started!
Please note: This tutorial is intended for educational purposes and to demonstrate technical capabilities only. Be aware that scraping data from LinkedIn is strictly prohibited, according to LinkedIn’s User Agreement. Any misuse of this information to scrape LinkedIn can lead to your LinkedIn account being permanently banned or other potential legal repercussions. Proceed at your own risk and discretion.
Before you begin, make sure you have Python version 3.7.9 or above installed on your system.
After installing Python, the next step is to set up the required libraries for web scraping. Here, you’ll utilize requests
to make HTTP requests, BeautifulSoup
(BS4) to parse HTML content, and Playwright
for browser interaction and task automation. Open your shell or terminal and run the following commands:
pip3 install beautifulsoup4
pip3 install requests
pip3 install playwright
LinkedIn Structure and Data Objects
Before you start scraping LinkedIn, the following section will talk about the site’s structure and identify the data objects you’ll extract. For the purpose of this tutorial, you’ll focus on scraping job listings, user profiles, articles, and company information:
- Job listings contain details such as job title, company, location, and job description.
- Course information can include the course title, instructor, duration, and description.
- Company data can include the company name, industry, size, location, and description.
- Articles are authored by professionals and cover topics such as professional development and industry insights.
For example, if you want to have a better understanding of the HTML structure of LinkedIn’s Jobs page, follow these steps:
- Go to the LinkedIn website and sign into your account.
- Click on the Jobs icon on the top navigation bar. Enter any job title (eg “frontend developer”) and press Enter.
- Right-click on a job item from the list and click Inspect to open the browser’s developer tools.
- Analyze the HTML structure to identify the tags and attributes containing the data you want to scrape.
Scrape Job Listings
Start by scraping job listings from LinkedIn. You’ll use requests
to fetch the page’s HTML content and BeautifulSoup
to parse and extract the relevant information.
Create a new file named scraper_linkedIn_jobs.py
and add the following code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/jobs/search?keywords=Frontend%20Developer&location=United%20States&pageNum=0'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
job_listings = soup.find_all('div', {'class':'job-search-card'})
for job in job_listings:
title = job.find('h3', {'class': 'base-search-card__title'}).text.strip()
company = job.find('a', {'class': 'hidden-nested-link'}).text.strip()
location = job.find('span', {'class': 'job-search-card__location'}).text.strip()
anchor_tag = job.find('a', class_='base-card__full-link')
href_link = anchor_tag['href']
print(f"Title: {title}\nCompany: {company}\nLocation: {location}\nJob Link: {href_link}\n")
else:
print("Failed to fetch job listings.")
This code fetches job listings from a LinkedIn search page for frontend developer positions in the United States.
Note: In the defined
url
, you can customize the job search to your preferences using URL parameters. For example, you can changelocation=United%20States
to the country of your choosing to find job listings in that specific location. Similarly, you can modifykeywords=Frontend%20Developer
to any other job title you’re interested in, allowing you to search for jobs based on different keywords. Additionally, you can adjust “pageNum=0” to navigate through various pages of search results to explore more job opportunities. These parameters grant you the flexibility to tailor the job search according to your desired criteria and preferences.
Run the code from your shell or terminal using the following command:
python3 scraper_linkedIn_jobs.py
You should get a list of jobs with their title, company, location, and a link to the job. Your results should look like this:
…output omitted…
Title: Frontend Engineer
Company: Klarity
Location: San Francisco, CA
Job Link: https://www.linkedin.com/jobs/view/desenvolvedor-front-end-at-pasquali-solution-3671519424?refId=JN%2FeM862Wu7qnbJd96Eoww%3D%3D&trackingId=kTSLczKp1q4aurZ5rSzRPQ%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card
Title: Front-End Developer (Remote)
Company: Prevail Legal
Location: United States
Job Link: https://www.linkedin.com/jobs/view/desenvolvedor-front-end-at-pasquali-solution-3671519424?refId=JN%2FeM862Wu7qnbJd96Eoww%3D%3D&trackingId=kTSLczKp1q4aurZ5rSzRPQ%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card
…output omitted…
Scrape LinkedIn Learning
In addition to scraping job listings, you can also scrape courses from the LinkedIn Learning page.
Create a new file named scraper_linkedIn_courses.py
and add the following code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/learning/search?trk=content-hub-home-page_guest_nav_menu_learning'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
course_listings = soup.find_all('li', {'class':'results-list__item'})
for course in course_listings:
title = course.find('h3', {'class': 'base-search-card__title'}).text.strip()
created_by = course.find('h4', {'class': 'base-search-card__subtitle'}).text.strip()
duration = course.find('div', {'class': 'search-entity-media__duration'}).text.strip()
# Find the anchor tag containing the link
anchor_tag = course.find('a', class_='base-card__full-link')
# Extract the 'href' attribute value
if anchor_tag:
href_link = anchor_tag['href']
else:
print("Anchor tag not found.")
print(f"Title: {title}\nCreated By: {created_by}\nDuration: {duration}\nCourse Link: {href_link}\n")
else:
print("Failed to fetch course listings.")
Here, you’re using requests
to access the LinkedIn Learning page and BeautifulSoup
to parse it. You’re searching for li
elements with the class results-list__item
, which contains the course listings. For each course, you extract and print the title, creator, duration, and link. If the initial request fails, you print a failure message.
Run the code from your shell or terminal using the following command:
python3 scraper_linkedIn_courses.py
You should get a list of courses with their title, author, and a link to the course. Your results will look like this:
…output omitted…
Title: Define general intelligence
Created By: From: Introduction to Artificial Intelligence
Duration: 3m
Course Link: https://www.linkedin.com/learning/introduction-to-artificial-intelligence/define-general-intelligence?trk=learning-serp_learning-search-card_search-card
Title: Shortcut menus and the Mini toolbar
Created By: From: Excel Essential Training (Microsoft 365)
Duration: 4m
Course Link: https://www.linkedin.com/learning/excel-essential-training-microsoft-365-17231101/shortcut-menus-and-the-mini-toolbar?trk=learning-serp_learning-search-card_search-card
Title: Learning Excel: Data Analysis
Created By: By: Curt Frye
Duration: 3h 16m
Course Link: https://www.linkedin.com/learning/learning-excel-data-analysis-18868618?trk=learning-serp_learning-search-card_search-card
…output omitted…
Scrape LinkedIn Articles
You can also scrape article data from the LinkedIn Articles page.
To do so, create a new file named scraper_linkedIn_articles.py
and add the following code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/pulse/topics/home/?trk=guest_homepage-basic_guest_nav_menu_articles'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
article_listings = soup.find_all('div', {'class':'content-hub-entities'})
for article in article_listings:
title = article.find('h2', {'class': 'break-words'}).text.strip()
description = article.find('p', {'class': 'content-description'}).text.strip()
# Find the anchor tag containing the link
anchor_tag = article.find('a', class_='min-w-0')
# Extract the 'href' attribute value
if anchor_tag:
href_link = anchor_tag['href']
else:
print("Anchor tag not found.")
print(f"Title: {title}\nDescription: {description}\nArticle Link: {href_link}\n")
else:
print("Failed to fetch article listings.")
In this code, you’re using requests
to fetch a LinkedIn page and BeautifulSoup
to parse it. You’re looking for div
elements with the class content-hub-entities
, which hold the article listings. For each article, you extract and print the title, description, and link. If the initial request fails, a failure message will print.
Run the code from your shell or terminal using the following command:
python3 scraper_linkedIn_articles.py
You’ll get a list of articles with their title, description, and a link to the article. Your results should look like this:
…output omitted…
Title: What are some of the emerging leadership trends and theories that you should be aware of?
Description: Learn about the six emerging leadership styles and frameworks that can help you develop your leadership skills and potential in a changing and complex world.
Article Link: https://www.linkedin.com/advice/1/what-some-emerging-leadership-trends-theories
Title: What are the most effective strategies for handling a leadership transition?
Description: Learn six strategies to manage a leadership transition smoothly and successfully, from assessing the situation to planning for the future.
Article Link: https://www.linkedin.com/advice/0/what-most-effective-strategies-handling
Title: How do you combine quality assurance training with other learning initiatives?
Description: Learn some strategies and tips for integrating quality assurance training with other learning objectives and methods in your organization.
Article Link: https://www.linkedin.com/advice/0/how-do-you-combine-quality-assurance-training
…output omitted…
All the code for this tutorial is available in this GitHub repository.
What to Consider When Scraping LinkedIn
LinkedIn, like many other websites, employs several techniques to prevent automated scraping of its data. Understanding these techniques can help you navigate around them and ensure your scraping activities are successful:
- Pagination: LinkedIn displays search results in paginated format. Ensure your scraping script handles pagination to retrieve all relevant data.
- Ads: LinkedIn displays ads in various sections. Make sure your scraping script targets the actual data and avoids extracting ad content.
- Rate limiting: LinkedIn monitors the number of requests coming from an IP address within a certain period. If the number of requests exceeds a certain limit, LinkedIn may temporarily or permanently block the IP address.
- CAPTCHA: LinkedIn may present a CAPTCHA challenge if it detects unusual activity from an IP address. CAPTCHAs are designed to be easy for humans to solve but difficult for bots, thus preventing automated scraping.
- Login requirement: Some data on LinkedIn is only accessible when logged in (ie user profiles and company pages). This means that any attempt to scrape this data would require an automated login, which LinkedIn can detect and block.
- Dynamic content: LinkedIn uses JavaScript to load some content dynamically. This can make it harder to scrape because the data may not be present in the HTML when the page initially loads.
robots.txt
: LinkedIn’srobots.txt
file specifies which parts of the site web crawlers are allowed to access. While not strictly a prevention technique, ignoring the directives in this file can lead to your IP being blocked.- Reduction in data points: LinkedIn has limited the types of publicly available information, such as education or experience. Due to split-testing, you may sometimes see these fields and sometimes not when scraping LinkedIn.
Remember, while it’s technically possible to navigate around these techniques, doing so may violate LinkedIn’s terms of service and can lead to your account being banned. Always ensure that your scraping activities are legal and ethical.
A Better Option: Use Bright Data to Scrape LinkedIn
Although manual web scraping works for small-scale data extraction, it becomes time-consuming and inefficient at scale. Bright Data offers a simpler and more efficient alternative, allowing you to effortlessly access vast amounts of LinkedIn data.
Bright Data offers two primary products for web scraping:
- Scraping Browser: The Scraping Browser is a browser-based solution that allows you to interact with websites just like a regular user. It handles JavaScript rendering, AJAX requests, and other complexities, making it ideal for scraping dynamic websites like LinkedIn.
- LinkedIn datasets: The LinkedIn dataset is a precollected and structured data set containing LinkedIn data, including job listings, user profiles, and company information. You can access and download the data directly from the Bright Data platform.
Set Up Your Bright Data Account
To access the LinkedIn data set on the Bright Data platform, follow these steps:
Create an account on the Bright Data website by clicking Start free trial and following the instructions.
After logging in, click on the credit card icon on the left navigation panel to go to the Billing page. Then add a payment method to activate your account:
Next, click on the pin icon to open the Proxies & Scraping Infrastructure page. Select Scraping Browser > Get started:
Give your solution a name and click on the Add button:
Select Access parameters and take note of your username, host, and password, as you’ll need them in the next step:
After you’ve completed all these steps, you can proceed to the next section.
Scrape LinkedIn Company Data Using the Scraping Browser
To scrape company data from a company’s page on LinkedIn, create a new file named scraper_linkedIn_bdata_company.py
and add the following code:
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
username='YOUR_BRIGHTDATA_USERNAME'
password='YOUR_BRIGHTDATA_PASSWORD'
auth=f'{username}:{password}'
host = 'YOUR_BRIGHTDATA_HOST'
browser_url = f'wss://{auth}@{host}'
async def main():
async with async_playwright() as pw:
print('connecting')
browser = await pw.chromium.connect_over_cdp(browser_url)
print('connected')
page = await browser.new_page()
print('goto')
await page.goto('https://www.linkedin.com/company/spacex/', timeout=120000)
print('done, evaluating')
# Get the entire HTML content
html_content = await page.evaluate('()=>document.documentElement.outerHTML')
# Parse the HTML with Beautiful Soup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract the 'About us' description
description_element = soup.select_one('.core-section-container[data-test-id="about-us"] p[data-test-id="about-us__description"]')
description = description_element.text if description_element else None
print('Description:')
print(description)
# Extract the 'Company size'
company_size_element = soup.select_one('div[data-test-id="about-us__size"] dd')
company_size = company_size_element.text.strip() if company_size_element else None
print('Company size:')
print(company_size)
await browser.close()
# Run the async function
asyncio.run(main())
In this code, you’re using Playwright for browser automation. You connect to a Chromium browser through a proxy, navigate to the company page of SpaceX, and extract the About us description and Company size.
To get the HTML content, you use Playwright’s evaluation method and then parse it with Beautiful Soup to find the specific elements and print the extracted information. You leverage Playwright’s asynchronous features by defining an async function called main()
, and you start the script’s execution with asyncio.run(main())
.
Note: Please ensure that you substitute
YOUR_BRIGHTDATA_USERNAME
,YOUR_BRIGHTDATA_PASSWORD
, andYOUR_BRIGHTDATA_HOST
with the correct and specific login credentials of your Bright Data account. This step is crucial to authenticate and access your account successfully.
Open your shell or terminal and run your code with the following command:
python3 scraper_linkedIn_bdata_company.py
You should have an output that looks like this:
…output omitted…
Description:
SpaceX designs, manufactures and launches the world's most advanced rockets and spacecraft. The company was founded in 2002 by Elon Musk to revolutionize space transportation, with the ultimate goal of making life multiplanetary.
SpaceX has gained worldwide attention for a series of historic milestones. It is the only private company ever to return a spacecraft from low-Earth orbit, which it first accomplished in December 2010. The company made history again in May 2012 when its Dragon spacecraft attached to the International Space Station, exchanged cargo payloads, and returned safely to Earth — a technically challenging feat previously accomplished only by governments. Since then Dragon has delivered cargo to and from the space station multiple times, providing regular cargo resupply missions for NASA.
For more information, visit www.spacex.com.
Company size:
1,001-5,000 employees
The initial approach you used to scrape LinkedIn can face challenges, including pop-ups and reCAPTCHA, leading to potential code blockages. However, utilizing the Bright Data Scraping Browser allows you to overcome these obstacles, ensuring uninterrupted scraping.
Bright Data LinkedIn Data Set
Another alternative option to manually scraping data from LinkedIn involves purchasing LinkedIn data sets, which will provide you with access to valuable personal data, including user profiles and additional information. Using the Bright Data LinkedIn data set eliminates the need for manual web scraping, saves time, and provides structured data ready for analysis.
To find out what data sets are available, go to your Bright Data Dashboard, then click on Web Data from the left navigation bar and select LinkedIn people profiles datasets (Public web data):
Now, you can apply filters to further refine your choices and obtain specific data that meets your criteria, then click “Purchase options” to see how much it’ll cost.
The pricing is based on the number of records you choose, allowing you to tailor your purchase according to your needs and budget. By opting to purchase these data sets, you can simplify your workflow significantly, avoiding the manual effort of data extraction and collection:
Conclusion
In this article, you learned how to manually scrape data from LinkedIn using Python, and you were introduced to Bright Data, a solution that simplifies and accelerates the data scraping process. Whether you’re scraping data for market research, talent acquisition, or competitive analysis, these tools and techniques can help you gather the information you need.
However, if you’re looking for a more efficient and reliable solution, consider using Bright Data’s LinkedIn Scraper API. This powerful API allows you to scrape LinkedIn data seamlessly, handling dynamic content and anti-bot measures with ease. Additionally, Bright Data offers LinkedIn datasets, providing pre-collected and structured data ready for analysis. With these tools, you can save time, ensure data accuracy, and focus on deriving actionable insights from the data. Start with a free trial today!
No credit card required
Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.