In this article, you will discover:
- What an Indeed scraper is and how it works
- The types of data you can extract automatically from Indeed
- How to build an Indeed scraping script using Python
- When and why you might need a more advanced solution
Let’s get started!
What Is an Indeed Scraper?
An Indeed scraper automatically extracts job listings and related data from the Indeed website. It works by mimicking human interactions to navigate job search pages. After that, it identifies specific elements like job titles, companies, locations, and descriptions. Finally, the scraping bot extracts data from them and exports it for analysis.
Data You Can Find on Indeed
Indeed is a treasure trove of job-related data, which can be invaluable for market analysis, recruitment, or research purposes. Below is a list of the key data points you can scrape from it:
- Job titles: The role or position advertised in the listing.
- Company names: Details of the employer, including company profiles.
- Locations: The city, state, or country where the job is based.
- Job descriptions: Detailed information about the role, responsibilities, and requirements.
- Salary ranges: Advertised pay scales (if available).
- Job types: Full-time, part-time, contract, internship, etc.
- Posting dates: When the job listing was published.
- Tags and attributes: Keywords like “Urgently Hiring” or “Remote.”
- Ratings and reviews: Employer ratings and employee feedback.
- Application options: Indicators like “Easy Apply” availability.
If your focus is on job positions, follow our guide on how to scrape job postings.
How to Scrape Indeed: Step-By-Step Guide
In this tutorial section, you will see how to create an Indeed scraper. You will be guided through the process of building a Python script to scrape the Indeed “data scientist” job posting page:
Follow the instructions and learn how to scrape Indeed!
Step #1: Project Setup
Before getting started, make sure you have Python 3 installed on your machine. Otherwise, download it and install it.
Now, launch the command below in the terminal to create a directory for your project:
mkdir indeed_scraper
indeed_scraper
will contain your Python Indeed scraper.
Enter it in the terminal, and initialize a virtual environment inside it:
cd indeed_scraper
python -m venv env
Next, load the project folder in your favorite Python IDE. Visual Studio Code with the Python extension and PyCharm Community Edition are both good options.
Create a scraper.py
file in the project’s directory, which should now contain this file structure:
scraper.py
will soon contain the desired scraping logic.
Time to activate the virtual environment in the IDE’s terminal. In Linux or macOS, do it with this command:
./env/bin/activate
Equivalently, on Windows, run:
env/Scripts/activate
Wonderful! You have a Python environment for Indeed web scraping.
Step #2: Choose the Right Scraping Library
The next step is to determine whether Indeed relies on dynamic or static pages. To do so, open the Indeed target page in incognito mode with your browser and start playing with it. As you can easily tell, most data on the page is loaded dynamically:
That is enough to say that you need a browser automation tool like Selenium to scrape Indeed effectively. For more guidance on this process, read our guide on Selenium web scraping.
Selenium enables you to programmatically control a web browser to simulate user interactions and scrape content rendered by JavaScript. Time to install it and get started with it!
Step #3: Install and Configure Selenium
In an activated virtual environment, run the following command to install Selenium:
pip install -U selenium
Import Selenium in scraper.py
and set up a WebDriver
object:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# Set up a controllable Chrome instance
driver = webdriver.Chrome(service=Service())
The code above initializes what you need to control a Chrome instance.
Note: Indeed has implemented anti-scraping measures to stop headless browsers from accessing its pages. Thus, setting the --headless
flag would make your script fail. As an alternative approach, take a look at Playwright Stealth.
As the last line of your script, do not forget to close the web driver:
driver.quit()
Amazing! You are fully configured to scrape Indeed.
Step #4: Visit the Target Page
With the get()
method from Selenium, instruct the controlled browser to visit the target page:
driver.get("https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP%2Cwhatautocomplete&vjk=45d1ba700870fbef")
scraper.py
will now contain the following lines of code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# Set up a controllable Chrome instance
driver = webdriver.Chrome(service=Service())
# Open the target page in the browser
driver.get("https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP%2Cwhatautocomplete&vjk=45d1ba700870fbef")
# Scraping logc...
# Close the web driver
driver.quit()
Add a debugging breakpoint on the final line. Run the script with the debugger, and below is what you should be seeing:
Note: The “Chrome is being controlled by automated test software.” notification tells you that Selenium is controlling Chrome as expected.
Well done!
Step #5: Select the Job Posting Elements
The Indeed job search page displays numerous job openings. Since we aim to scrape all of them, start by initializing an array to store the scraped data:
jobs = []
Next, inspect the HTML elements of the job openings on the page to understand how to select them:
Here, each job element is a slider_item
node inside the #mosaic-provider-jobcards
container.
Normally, you would use CSS classes to select elements on the page. However, these classes appear to be randomly generated—likely at build time. To ensure stability, it is better to target the id
and data-testid
attributes, which are less likely to change frequently.
Rely on Selenium to select the job elements:
jobs_container_element = driver.find_element(By.CSS_SELECTOR, "#mosaic-provider-jobcards")
job_elements = jobs_container_element.find_elements(By.CSS_SELECTOR, "[data-testid=\"slider_item\"]")
The find_elements()
method applies the specified selector strategy to retrieve all matching elements from the page. In this case, the selector strategy is a CSS selector.
Make sure to import By
for this to work:
from selenium.webdriver.common.by import By
Now, iterate over the selected elements and prepare to scrape data from each one:
for job_element in job_elements:
# scrape data from each job opening
Fantastic! You are ready to start scraping job positions from Indeed.
Step #6: Scrape the Job Main Info
Inspect a card element, focusing on the information in the upper section of the card:
Here you see that you can scrape:
- The job title from the
<h2>
- The job page URL from the
<a>
inside the title<h2>
- The company name from the
[data-testid="company-name"]
node - The company location from the
[data-testid="text-location"]
element
Transfrom the information above in scraping logic as follows:
title_element = job_element.find_element(By.CSS_SELECTOR, "h2.jobTitle")
title = title_element.text
url_element = title_element.find_element(By.CSS_SELECTOR, "a")
url = url_element.get_attribute("href")
company_element =job_element.find_element(By.CSS_SELECTOR, "[data-testid=\"company-name\"]")
company = company_element.text
location_element = job_element.find_element(By.CSS_SELECTOR, "[data-testid=\"text-location\"]")
location = location_element.text
find_element()
selects the first element matching the given selector. Given a node, you can then access its text content with the text
attribute. To get the value of an HTML attribute of the node, you must use the get_attribute()
method.
Cool! You have laid the groundwork for your Indeed scraping logic, but there is still useful data left to scrape.
Step #7: Scrape the Job Details
Focus on the details section of the job position card:
This time, the information to scrape is:
- The tags of the job position in one or more
[data-testid="attribute_snippet_testid"]
elements inside a.jobMetaDataGroup
<div>
- Whether there is an option to apply easily through Indeed
- The description items in one or more
ul li
elements inside a[role="presentation"]
<div>
Let’s start by targeting the tags. You can scrape them all with:
tags = []
tags_container_element = job_element.find_element(By.CSS_SELECTOR, ".jobMetaDataGroup")
tag_elements = tags_container_element.find_elements(By.CSS_SELECTOR, "[data-testid=\"attribute_snippet_testid\"]")
for tag_element in tag_elements:
tag = tag_element.text
tags.append(tag)
First, you have to initialize an array where to store all retrieved tags. That is required as a single job opening card can contain multiple tags. After selecting them, iterate over them, extract text from them, and add the tags to the array.
Scraping the “Easily apply” information is tricky, too. The problem is that the HTML element indicating that possibility is not present in all job positions. Clearly, it is only present where the “Easily apply” option is supported.
When you try to select an element that is not on the page, Selenium raises a NoSuchElementException
. Thus, you can use that to scrape the “Easily apply” check effectively:
try:
job_element.find_element(By.CSS_SELECTOR, "[data-testid=\"indeedApply\"]")
easily_apply = True
except NoSuchElementException:
easily_apply = False
It the [data-testid="indeedApply"]
node is not on the page, Selenium will raise a NoSuchElementException
. That will be intercepted, and easily_apply
will be set to False
.
As for the description items, you can scrape them all as you did for the tags:
description = []
description_container_element = job_element.find_element(By.CSS_SELECTOR, "[role=\"presentation\"]")
description_elements = description_container_element.find_elements(By.CSS_SELECTOR, "ul li")
for description_element in description_elements:
description_item_text = description_element.text
# Ignore empty description strings
if (description_item_text != ""):
description.append(description_item_text)
Wow! The Indeed scraper is almost complete.
Step #8: Collect the Scraped Data
With the scraped data from each job position, populate a job
dictionary:
job = {
"title": title,
"url": url,
"company": company,
"location": location,
"tags": tags,
"easily_apply": easily_apply,
"description": description
}
Then, add it to the jobs
array:
jobs.append(job)
At the end of the for
loop, products
should contain something like:
[{'title': 'Data Scientist', 'url': 'https://www.indeed.com/rc/clk?jk=efc7b7f4a8be2882&bb=NM368jsOPyYGAfEtQk2NNae8tSeBHdJ8Y9tImVa1Q9GAipGe0zzddcUozFEL0Na_pYCR4W6ljgljsBxWTUrluVuL8Gom7x7UZlgMzs0spo3NRgisrZ7meuaPfaEcjWoe&xkcb=SoD767M34WNyEaSTwx0FbzkdCdPP&fccid=8678bc4e64c24580&vjs=3', 'company': 'GQR', 'location': 'New York, NY', 'tags': [], 'easily_apply': False, 'description': ['Stay current with industry trends and emerging technologies to ensure competitive edge.', 'Apply statistical and machine learning techniques to improve investment…']},
# omitted for brevity...
{'title': 'Data Scientist, Financial Crimes - USDS', 'url': 'https://www.indeed.com/rc/clk?jk=aaa16dfd1cc6ef01&bb=NM368jsOPyYGAfEtQk2NNdxizAZQnHpzRrlr6WgbV1RtxmXz4vto1qiiqGiIj9CJFQQCV6cW59nE4hGw1yeNdokPfu8Fgl3EALBx5zdWjPm4COEu78DCFh4KTUMIFWkh&xkcb=SoAT67M34WNyEaSTwx0pbzkdCdPP&fccid=caed318a9335aac0&vjs=3', 'company': 'TikTok', 'location': 'Hybrid work in New York, NY', 'tags': [], 'easily_apply': False, 'description': ['As a Financial Crime Data Scientist, you will play a crucial role in leveraging machine learning, analytics and visualization techniques to enhance our…']}]
Marvelous! You only have to convert this data to a better format.
Step #9: Export the Scraped Data to CSV
To make the scraped data accessible and shareable, it is a good idea to export it to a human-readable format. For example, wrte it in a CSV file. To do so, use these lines of code:
csv_file = "scraped_jobs.csv"
csv_headers = ["title", "url", "company", "location", "tags", "easily_apply", "description"]
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=csv_headers)
writer.writeheader()
for job in jobs:
writer.writerow({
"title": job["title"],
"url": job["url"],
"company": job["company"],
"location": job["location"],
"tags": ";".join(job["tags"]),
"easily_apply": "Yes" if job["easily_apply"] else "No",
"description": ";".join(job["description"])
})
The open()
function creates the output CSV file, which is then populated with csv.DictWriter
. Since the tags
and description
fields are arrays, join()
is used to flatten them into a single string with elements separated by ;
.
Do not forget to import csv
from the Python Standard Library:
import csv
Here we go! The Indeed scraper is complete.
Step #10: Put It All Together
Your final scraper.py
file will now contain:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common import NoSuchElementException
import csv
# Set up a controllable Chrome instance
driver = webdriver.Chrome(service=Service())
# Open the target page in the browser
driver.get("https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnDesktopSerp")
# A data structure where to store the scraped job openings
jobs = []
# Select the job opening elements on the page
jobs_container_element = driver.find_element(By.CSS_SELECTOR, "#mosaic-provider-jobcards")
job_elements = jobs_container_element.find_elements(By.CSS_SELECTOR, "[data-testid=\"slider_item\"]")
# Scrape each job opening on the page
for job_element in job_elements:
title_element = job_element.find_element(By.CSS_SELECTOR, "h2.jobTitle")
title = title_element.text
url_element = title_element.find_element(By.CSS_SELECTOR, "a")
url = url_element.get_attribute("href")
company_element =job_element.find_element(By.CSS_SELECTOR, "[data-testid=\"company-name\"]")
company = company_element.text
location_element = job_element.find_element(By.CSS_SELECTOR, "[data-testid=\"text-location\"]")
location = location_element.text
tags = []
tags_container_element = job_element.find_element(By.CSS_SELECTOR, ".jobMetaDataGroup")
tag_elements = tags_container_element.find_elements(By.CSS_SELECTOR, "[data-testid=\"attribute_snippet_testid\"]")
for tag_element in tag_elements:
tag = tag_element.text
tags.append(tag)
# Check whether the "Easy Apply" element is on the page
try:
job_element.find_element(By.CSS_SELECTOR, "[data-testid=\"indeedApply\"]")
easily_apply = True
except NoSuchElementException:
easily_apply = False
description = []
description_container_element = job_element.find_element(By.CSS_SELECTOR, "[role=\"presentation\"]")
description_elements = description_container_element.find_elements(By.CSS_SELECTOR, "ul li")
for description_element in description_elements:
description_item_text = description_element.text
# Ignore empty description strings
if (description_item_text != ""):
description.append(description_item_text)
# Store the scraped data
job = {
"title": title,
"url": url,
"company": company,
"location": location,
"tags": tags,
"easily_apply": easily_apply,
"description": description
}
jobs.append(job)
# Export the scraped data to an output CSV file
csv_file = "jobs.csv"
csv_headers = ["title", "url", "company", "location", "tags", "easily_apply", "description"]
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=csv_headers)
writer.writeheader()
for job in jobs:
writer.writerow({
"title": job["title"],
"url": job["url"],
"company": job["company"],
"location": job["location"],
"tags": ";".join(job["tags"]),
"easily_apply": "Yes" if job["easily_apply"] else "No",
"description": ";".join(job["description"])
})
# Close the web driver
driver.quit()
In less than 100 lines of code, you just built an Indeed scraper in Python!
Launch the scraper with the following command:
python3 script.py
Or, on Windows:
python script.py
A jobs.csv
file will appear in your project’s folder. Open it and you will see:
Et voilà! Mission complete.
Unlock Indeed Data With Ease
Indeed is well aware of the value of its data and employs robust measures to protect it. This is why, when interacting with its pages using a browser automation tool like Selenium, you are likely to encounter a CAPTCHA:
As a first step, consider following our guide on how to bypass CAPTCHAs in Python. Nevertheless, be aware that the site might still block your attempts with additional anti-bot measures. Discover them all in our webinar on anti-bot techniques.
These challenges highlight how scraping Indeed without the proper tools can quickly become frustrating and inefficient. Moreover, the inability to use headless browsers makes your scraping script slower and more resource-intensive.
The solution? Bright Data’s Indeed Scraper API, which tool lets you retrieve data from Indeed seamlessly through simple API calls—no CAPTCHAs, no blocks, and no hassle!
Conclusion
In this step-by-step guide, you learned what an Indeed scraper is, the types of data it can retrieve, and how to build one in Python. In just around 100 lines of code, you created a script that automatically collects data from Indeed.
Still, scraping Indeed comes with its challenges. The platform enforces strict anti-bot measures, including CAPTCHAs. These are difficult to bypass and can slow down your scraping process, making it less efficient. Forget about all those challenges with our Indeed Scraper API.
If web scraping is not your thing but you are still interested in job openings data, explore our ready-to-use Indeed datasets!
Create a free Bright Data account today to try our scraper APIs or explore our datasets.
No credit card required