Follow this step-by-step tutorial and learn how to build a web scraping Indeed Python script to automatically retrieve data about job openings.
This guide will cover:
- Why scrape jobs data from the web?
- Libraries and tools for scraping Indeed
- Scraping jobs data from Indeed with Selenium
Why Scrape Jobs Data From the Web?
Web scraping jobs data from the Web is useful for several reasons, including:
- Market research: Allow businesses and job market analysts to gather information on industry trends. For example, this involves which skills are in high demand or the geographic regions experiencing job growth. It also enables you to monitor competitors’ hiring activities.
- Streamlining job search and matching: Help job seekers search job listings from multiple sources to find positions that match their qualifications and preferences.
- Recruitment and HR optimization: Support the recruitment process by facilitating hiring and helping to understand market salary trends and benefits sought by candidates.
Thus, job data is useful for both employers and job seekers.
When it comes to scraping job listings, there is one essential aspect to stress. The target platform needs to be public. In other terms, it must allow even non-logged-in users to perform job searches. This is because scraping data under a login wall can get you in trouble for legal reasons.
That means taking LinkedIn out of the equation. What other job platforms remain? Indeed, one of the leading online job platforms!
Libraries and Tools for Scraping Indeed
Python is considered one of the best languages for scraping thanks to its syntax, ease of use, and rich ecosystem of libraries. So, let’s go for it. Check out our guide on web scraping with Python.
You now need to choose the right scraping libraries out of the many available. To make an informed decision, explore Indeed in your browser. You will notice that most of the data on the site is retrieved after interaction. This means that the site heavily on AJAX to load and update content dynamically without using page reloads. To do web scraping on such a site you need a tool that is able to run JavaScript. That tool is Selenium!
Selenium makes it possible to scrape dynamic websites in Python. It renders sites in a controllable web browser, performing operations as you instruct it. Thanks to Selenium, you can scrape data even if the target site uses JavaScript for rendering or data retrieval.
Learn how to scrape job postings off websites like Indeed!
Scraping Jobs Data From Indeed With Selenium
Follow this step-by-step tutorial and see how to build a web scraping Indeed Python script.
Step 1: Project setup
Before web scraping jobs, make sure you meet these prerequisites:
- Python 3+ installed on your machine: Download the installer, double-click on it, and follow the installation wizard.
- A Python IDE of your choice: PyCharm Community Edition or Visual Studio Code with the Python extension are two great choices.
You now have everything you need to set up a Python project!
Open the terminal and launch the following commands to:
- Create an indeed-scraper folder
- Enter it
- Initialize it with a Python virtual environment
mkdir indeed-scraper
cd indeed-scraper
python -m venv env
On Linux or macOS, run the command below to activate the environment:
./env/bin/activate
While on Windows, execute:
env\Scripts\activate.ps1
Next, initialize a scraper.py
file containing the line below in the project folder:
print("Hello, World!")
Right now, it only prints “Hello, World!” but it will soon contain the Indeed scraping logic.
Launch it to verify that it works with:
python scraper.py
If all went as planned, it should print this message in the terminal:
Hello, World!
Now that you know that the script works, open the project folder in your Python IDE.
Well done! Get ready to write some Python code!
Step 2: Install the scraping libraries
As mentioned earlier, Selenium is a great tool when it comes to web scraping job postings from Indeed. Run the command below in the activated Python virtual environment to add it to the project’s dependencies:
pip install selenium
This might take a while, so be patient.
Please note that this tutorial refers to Selenium 4.11.2
, which comes with automatic driver detection capabilities. If you have an older version of Selenium installed on your PC, update it with:
pip install selenium -U
Now, clear scraper.py
. Then, import the package and initialize a Selenium scraper with:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# set up a controllable Chrome instance
# in headless mode
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
service=service,
options=options
)
# scraping logic...
# close the browser and free up the resources
driver.quit()
This script instantiates an instance of WebDriver to programmatically control a Chrome instance. The browser will be opened behind the scenes in headless mode, which means with no GUI. That is a common setup for production. If you instead prefer to follow the operations run by the web scraping jobs script on the page, comment out that option. This is useful in development.
Make sure that your Python IDE does not report any errors. Ignore the warnings you may receive because of the unused imports. You are about to use the libraries to extract repository data from GitHub!
Perfect! Time to build your web scraping Indeed Python scraper.
Step 3: Connect to the target web page
Open Indded and search for jobs you are interested in. In this guide, you will see how to scrape remote job postings for software engineers in New York. Keep in mind that any other Indeed job search will do. The scraping logic will be the same.
Here is what the target page looks like in the browser as of this writing:
Specifically, this is what the URL of the target page looks like:
https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100
As you can see, it is a dynamic URL that changes based on some query parameters.
You can then use Selenium to connect to the target page with:
driver.get("https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100")
The get() function instructs the browser to visit the page specified by the URL passed as a parameter.
After opening the page, you should set the window size to ensure that every element is visible:
driver.set_window_size(1920, 1080)
This is what your scraping Indeed script looks like so far:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# set up a controllable Chrome instance
# in headless mode
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
service=service,
options=options
)
# set the window size to make sure pages
# will not be rendered in responsive mode
driver.set_window_size(1920, 1080)
# open the target page in the browser
driver.get("https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100")
# scraping logic...
# close the browser and free up the resources
driver.quit()
Comment out the option to enable headless mode and launch the script. It will open the window below for a fraction of a second before closing:
Note the “Chrome is being controlled by automated software” disclaimer. That ensures Selenium is working as expected.
Step 4: Familiarize yourself with the page structure
Before diving into scraping, there is another crucial step to carry out. Scraping data from a site involves selecting HTML elements and extracting data from them. Finding a way to get the desired nodes from the DOM is not always easy. Here is why you should spend some time analyzing the page structure to understand how to define an effective selection strategy.
Open your browser and visit the Indeed job search page. Right-click on any element and select the “Inspect” option to open the DevTools of your browser:
Here, you will see that most elements containing interesting data have CSS classes such as the following:
css-j45z4f
,css-1m4cuuf
, …e37uo190
,eu4oa1w0
, …job_f27ade40cc1a3686
,job_1a53a17f1faeae92
, …
Since these appear to be randomly generated at compile time, you should not rely on them for scraping. Instead, you should base the selection logic on classes such as:
jobsearch-JobInfoHeader-title
date
cardOutline
Or IDs like:
companyRatings
applyButtonLinkContainer
jobDetailsSection
Also, note that some nodes have unique HTML attributes:
data-company-name
data-testid
That is useful information to keep in mind for web scraping jobs from Indeed. Interact with the page to study how it reacts and what data it shows. You will realize that different job openings have different info attributes.
Keep inspecting the target site and familiarize yourself with its DOM structure until you feel ready to move on.
Step 5: Start extracting the job data
A single Indeed search page involves several job openings. So, you need an array to keep track of the jobs scraped from the page:
jobs = []
As you must have noticed in the previous step, the job postings are shown in .cardOutline
cards:
Select them all with:
job_cards = driver.find_elements(By.CSS_SELECTOR, ".cardOutline")
The find_elements()
method from Selenium allows you to locate web elements on a web page. Similarly, there is also the find_element()
method to get the first node that matches the selection query.
By.CSS_SELECTOR
instructs the driver to use a CSS selector strategy. Selenium also supports:
By.ID
: To search for an element by theid
HTML attributeBy.TAG_NAME
: To search for elements based on their HTML tagBy.XPATH
: To search for elements via an XPath expression
Import By
with:
from selenium.webdriver.common.by import By
Iterate over the list of job cards, and initialize a Python dictionary where to store the job details:
for job_card in job_cards:
# initialize a dictionary to store the scraped job data
job = {}
# job data extraction logic...
A job posting can have several attributes. Since only a small portion of them are mandatory, initialize a list of variables with default values right away:
posted_at = None
applications = None
title = None
company_name = None
company_rating = None
company_reviews = None
location = None
location_type = None
apply_link = None
pay = None
job_type = None
benefits = None
description = None
Now that you are familiar with the page, you know that some details are in the outline job card. Others are instead in the details tab that show up upon interaction.
For example, the creation date and the number of applications are in the summary tab:
Extract them both with:
try:
date_element = job_card.find_element(By.CSS_SELECTOR, ".date")
date_element_text = date_element.text
posted_at_text = date_element_text
if "•" in date_element_text:
date_element_text_array = date_element_text.split("•")
posted_at_text = date_element_text_array[0]
applications = date_element_text_array[1] \
.replace("applications", "") \
.replace("in progress", "") \
.strip()
posted_at = posted_at_text \
.replace("Posted", "") \
.replace("Employer", "") \
.replace("Active", "") \
.strip()
except NoSuchElementException:
pass
This snippet highlights some patterns that are key to web scraping job postings from Indeed. As most info elements are optional, you must protect against the following error:
selenium.common.exceptions.NoSuchElementException: Message: no such element
Selenium throws it when trying to select an HTML element that is not currently on the page.
Import the exception with:
from selenium.common import NoSuchElementException
The try ... catch
instruction ensures that if the target element is not on the DOM, the script will continue without failure.
Also, some job information is contained in strings like:
<info_1> • <info_2>
If <info_2>
is missing, the string format is instead:
<info_1>
Thus, you need to change the data extraction logic based on the presence of the "``•``"
character.
Given an HTML element, you can access its text content with the text
attribute. Use the replace()
Python strings to clean out the collect strings.
Step 6: Deal with Indeed anti-scraping measures
Indeed adopt some techniques and technologies to prevent bots from accessing its data. For example, when interacting with the job cards it tends to open this modal from time to time:
This popup blocks interaction. If not properly addressed, it will then stop your Selenium Indeed script. Inspect it in the DevTools and put your attention on the close button:
Close this modal in Selenium with:
try:
dialog_element = driver.find_element(By.CSS_SELECTOR, "[role=dialog]")
close_button = dialog_element.find_element(By.CSS_SELECTOR, ".icl-CloseButton")
close_button.click()
except NoSuchElementException:
pass
The click()
method from Selenium enables you to click on the selected element in the controlled browser.
Great! This will close the popup and let you continue the interaction.
Another data protection technique to seriously keep into account is Cloudflare. When interacting too much with the page and producing too many requests, Indeed will show you this anti-bot screen:
Solving Cloudflare CAPTCHAs from Selenium is a very challenging task that requires a premium product. Scraping Indeed is not that easy, after all. Fortunately, you can avoid them by introducing some random delays in your script.
Make sure the last operation in your for
loop is:
time.sleep(random.uniform(1, 5))
This will stop the script from a random number of seconds from 1 to 5.
Import the required packages from the Python Standard Library with:
import random
import time
Way to go! Nothing will stop your automated script from scraping Indeed.
Step 7: Open the job details card
When you click on an outline job card, Indeed performs an AJAX call to retrieve the details on the fly. While waiting for this data, the page shows an animated placeholder:
You can verify that the details sections have been loaded when the element below is on the page:
So, to get access to the job details data in Selenium you have to:
- Perform the click operation
- Wait for the page to contain the data of interest
Achieve that with:
job_card.click()
try:
title_element = WebDriverWait(driver, 5) \
.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".jobsearch-JobInfoHeader-title")))
title = title_element.text.replace("\n- job post", "")
except NoSuchElementException:
continue
The WebDriverWait
object from Selenium allows you to wait for a specific condition to happen. In this case, the script waits up to 5 seconds for .jobsearch-JobInfoHeader-title
to be on the page. After that, it will throw a TimeoutException
.
Note that the above snippet also retrieves the title of the job opening.
Import WebDriverWait
and EC
:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
From now on, the element to focus on is this detail column:
Select it with:
job_details_element = driver.find_element(By.CSS_SELECTOR, ".jobsearch-RightPane")
Fantastic! You are all set to scrape some job data!
Step 8: Extract the job details
Time to populate the variables we defined in step 4 with some job data.
Get the name of the company behind the job opening:
try:
company_link_element = job_details_element.find_element(By.CSS_SELECTOR, "div[data-company-name='true'] a")
company_name = company_link_element.text
except NoSuchElementException:
pass
Then, extract information on the company’s user ratings and number of reviews:
As you can see, there is not an easy way to access the element storing the number of reviews.
try:
company_rating_element = job_details_element.find_element(By.ID, "companyRatings")
company_rating = company_rating_element.get_attribute("aria-label").split("out")[0].strip()
company_reviews_element = job_details_element.find_element(By.CSS_SELECTOR, "[data-testid='inlineHeader-companyReviewLink']")
company_reviews = company_reviews_element.text.replace(" reviews", "")
except NoSuchElementException:
pass
Next, focus on the company location:
Again, you need to apply the "``•``"
pattern mentioned in step 4:
try:
company_location_element = job_details_element.find_element(By.CSS_SELECTOR,
"[data-testid='inlineHeader-companyLocation']")
company_location_element_text = company_location_element.text
location = company_location_element_text
if "•" in company_location_element_text:
company_location_element_text_array = company_location_element_text.split("•")
location = company_location_element_text_array[0]
location_type = company_location_element_text_array[1]
except NoSuchElementException:
pass
Since you may want to quickly apply for the job, take a look at the Indeed “Apply on company site” button as well:
Retrieve the button’s target URL with:
try:
apply_link_element = job_details_element.find_element(By.CSS_SELECTOR, "#applyButtonLinkContainer button")
apply_link = apply_link_element.get_attribute("href")
except NoSuchElementException:
pass
The get_attribute()
from Selenium returns the value of the specified HTML attribute.
Now, the tricky part begins.
If you inspect the “Job details” section, you will notice that there is not an easy way to select the pay and job type elements:
What you can do is:
- Get all
<div>
s inside the “Job details”<div>
- Iterate over them
- If the current
<div>
’s text contains “Pay” or “Job Type,” get the next sibling - Extract the data of interest
In other words, you have to implement the logic below:
for div in job_details_element.find_elements(By.CSS_SELECTOR, "#jobDetailsSection div"):
if div.text == "Pay":
pay_element = div.find_element(By.XPATH, "following-sibling::*")
pay = pay_element.text
elif div.text == "Job Type":
job_type_element = div.find_element(By.XPATH, "following-sibling::*")
job_type = job_type_element.text
Selenium does not provide a utility method for accessing the siblings of a node. What you can do instead is to use the following-sibling::*
Xpath expression.
Now, focus on the job’s benefits. Usually, there are more than one:
To retrieve them all, you need to initialize a list and populate it with:
try:
benefits_element = job_details_element.find_element(By.ID, "benefits")
benefits = []
for benefit_element in benefits_element.find_elements(By.TAG_NAME, "li"):
benefit = benefit_element.text
benefits.append(benefit)
except NoSuchElementException:
pass
Finally, get the raw job description:
Extract the text of the description with:
try:
description_element = job_details_element.find_element(By.ID, "jobDescriptionText")
description = description_element.text
except NoSuchElementException:
pass
Populate the job
dictionary and add it to the jobs
list:
job["posted_at"] = posted_at
job["applications"] = applications
job["title"] = title
job["company_name"] = company_name
job["company_rating"] = company_rating
job["company_reviews"] = company_reviews
job["location"] = location
job["location_type"] = location_type
job["apply_link"] = apply_link
job["pay"] = pay
job["job_type"] = job_type
job["benefits"] = benefits
job["description"] = description
jobs.append(job)
You can also add a log instruction to verify that the script works as expected:
print(job)
Run the script:
python scraper.py
This will produce an output similar to:
{'posted_at': '17 days ago', 'applications': '50+', 'title': 'Software Support Engineer', 'company_name': 'Integrated DNA Technologies (IDT)', 'company_rating': '3.5', 'company_reviews': '95', 'location': 'New York, NY 10001', 'location_type': 'Remote', 'apply_link': 'https://www.indeed.com/applystart?jk=c00120130a9c933b&from=vj&pos=bottom&mvj=0&jobsearchTk=1h9fpft0fj3t3800&spon=0&sjdu=YmZE5d5THV8u75cuc0H6Y26AwfY51UOGmh3Z9h4OvXiYhWlsa56nLum9aT96NeA9XAwdulcUk0atwlDdDDqlBQ&vjfrom=tp-semfirstjob&astse=bcf3778ad128bc26&assa=2447', 'pay': '$80,000 - $100,000 a year', 'job_type': 'Full-time', 'benefits': ['401(k)', '401(k) matching', 'Dental insurance', 'Health insurance', 'Paid parental leave', 'Paid time off', 'Parental leave', 'Vision insurance'], 'description': "Integrated DNA Technologies (IDT) is the leading manufacturer of custom oligonucleotides and proprietary technologies for (omitted for brevity...)"}
Et voilà! You just learned how to scrape job postings off websites.
Step 9: Scrape multiple job opening pages
A typical job search on Indeed produces a paginated list with dozens of results. Saw how to scrape each page!
First, inspect a page and note how Indeed behaves. In detail, it shows the following element when there is a next page available.
Otherwise, the next page element is missing:
Keep in mind that Indeed may return a list with hundreds of job openings. Since you do not want your script to run forever, consider adding a limit to the number of pages scraped.
Implement web crawling on Indeed in Selenium with:
pages_scraped = 0
pages_to_scrape = 5
while pages_scraped < pages_to_scrape:
job_cards = driver.find_elements(By.CSS_SELECTOR, ".cardOutline")
for job_card in job_cards:
# scraping logic...
pages_scraped += 1
# if this is not the last page, go to the next page
# otherwise, break the while loop
try:
next_page_element = driver.find_element(By.CSS_SELECTOR, "a[data-testid=pagination-page-next]")
next_page_element.click()
except NoSuchElementException:
break
The Indeed scraper will now keep looping until it reaches the last page or goes through 5 pages.
Step 10: Export scraped data to JSON
Right now, the scraped data is stored in a list of Python dictionaries. Export it to JSON to make it easier to share and read.
First, create an output object:
output = {
"date": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
"jobs": jobs
}
The date
attribute is required because the job opening publication dates are in the format “<X> days ago.” Without some context on the day the jobs data was scraped, it would be difficult to understand it.
Remeber to import datetime
:
from datetime import datetime
Then, export it with:
import json
# scraping logic...
with open("jobs.json", "w") as file:
json.dump(output, file, indent=4)
The above snippet initializes a jobs.json
output file with open()
and it populates it with JSON data via json.dump()
. Check out our article to learn more about how to parse and serialize data to JSON in Python.
The json
package comes from Python Standard Library, so you do not even need to install an extra dependency to achieve the objective.
Wow! You started from raw job data contained in a webpage and now have semi-structured JSON data. You are ready to take a look at the entire web scraping Indeed Python script.
Step 11: Put it all together
Here is the complete scraper.py
file:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common import NoSuchElementException
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import random
import time
from datetime import datetime
import json
# set up a controllable Chrome instance
# in headless mode
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
service=service,
options=options
)
# open the target page in the browser
driver.get("https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100")
# set the window size to make sure pages
# will not be rendered in responsive mode
driver.set_window_size(1920, 1080)
# a data structure where to store the job openings
# scraped from the page
jobs = []
pages_scraped = 0
pages_to_scrape = 3
while pages_scraped < pages_to_scrape:
# select the job posting cards on the page
job_cards = driver.find_elements(By.CSS_SELECTOR, ".cardOutline")
for job_card in job_cards:
# initialize a dictionary to store the scraped job data
job = {}
# initialize the job attributes to scrape
posted_at = None
applications = None
title = None
company_name = None
company_rating = None
company_reviews = None
location = None
location_type = None
apply_link = None
pay = None
job_type = None
benefits = None
description = None
# get the general job data from the outline card
try:
date_element = job_card.find_element(By.CSS_SELECTOR, ".date")
date_element_text = date_element.text
posted_at_text = date_element_text
if "•" in date_element_text:
date_element_text_array = date_element_text.split("•")
posted_at_text = date_element_text_array[0]
applications = date_element_text_array[1] \
.replace("applications", "") \
.replace("in progress", "") \
.strip()
posted_at = posted_at_text \
.replace("Posted", "") \
.replace("Employer", "") \
.replace("Active", "") \
.strip()
except NoSuchElementException:
pass
# close the anti-scraping modal
try:
dialog_element = driver.find_element(By.CSS_SELECTOR, "[role=dialog]")
close_button = dialog_element.find_element(By.CSS_SELECTOR, ".icl-CloseButton")
close_button.click()
except NoSuchElementException:
pass
# load the job details card
job_card.click()
# wait for the job details section to load after the click
try:
title_element = WebDriverWait(driver, 5) \
.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".jobsearch-JobInfoHeader-title")))
title = title_element.text.replace("\n- job post", "")
except NoSuchElementException:
continue
# extract the job details
job_details_element = driver.find_element(By.CSS_SELECTOR, ".jobsearch-RightPane")
try:
company_link_element = job_details_element.find_element(By.CSS_SELECTOR, "div[data-company-name='true'] a")
company_name = company_link_element.text
except NoSuchElementException:
pass
try:
company_rating_element = job_details_element.find_element(By.ID, "companyRatings")
company_rating = company_rating_element.get_attribute("aria-label").split("out")[0].strip()
company_reviews_element = job_details_element.find_element(By.CSS_SELECTOR, "[data-testid='inlineHeader-companyReviewLink']")
company_reviews = company_reviews_element.text.replace(" reviews", "")
except NoSuchElementException:
pass
try:
company_location_element = job_details_element.find_element(By.CSS_SELECTOR,
"[data-testid='inlineHeader-companyLocation']")
company_location_element_text = company_location_element.text
location = company_location_element_text
if "•" in company_location_element_text:
company_location_element_text_array = company_location_element_text.split("•")
location = company_location_element_text_array[0]
location_type = company_location_element_text_array[1]
except NoSuchElementException:
pass
try:
apply_link_element = job_details_element.find_element(By.CSS_SELECTOR, "#applyButtonLinkContainer button")
apply_link = apply_link_element.get_attribute("href")
except NoSuchElementException:
pass
for div in job_details_element.find_elements(By.CSS_SELECTOR, "#jobDetailsSection div"):
if div.text == "Pay":
pay_element = div.find_element(By.XPATH, "following-sibling::*")
pay = pay_element.text
elif div.text == "Job Type":
job_type_element = div.find_element(By.XPATH, "following-sibling::*")
job_type = job_type_element.text
try:
benefits_element = job_details_element.find_element(By.ID, "benefits")
benefits = []
for benefit_element in benefits_element.find_elements(By.TAG_NAME, "li"):
benefit = benefit_element.text
benefits.append(benefit)
except NoSuchElementException:
pass
try:
description_element = job_details_element.find_element(By.ID, "jobDescriptionText")
description = description_element.text
except NoSuchElementException:
pass
# store the scraped data
job["posted_at"] = posted_at
job["applications"] = applications
job["title"] = title
job["company_name"] = company_name
job["company_rating"] = company_rating
job["company_reviews"] = company_reviews
job["location"] = location
job["location_type"] = location_type
job["apply_link"] = apply_link
job["pay"] = pay
job["job_type"] = job_type
job["benefits"] = benefits
job["description"] = description
jobs.append(job)
# wait for a random number of seconds from 1 to 5
# to avoid rate limiting blocks
time.sleep(random.uniform(1, 5))
# increment the scraping counter
pages_scraped += 1
# if this is not the last page, go to the next page
# otherwise, break the while loop
try:
next_page_element = driver.find_element(By.CSS_SELECTOR, "a[data-testid=pagination-page-next]")
next_page_element.click()
except NoSuchElementException:
break
# close the browser and free up the resources
driver.quit()
# produce the output object
output = {
"date": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
"jobs": jobs
}
# export it to JSON
with open("jobs.json", "w") as file:
json.dump(output, file, indent=4)
In less than 200 lines of code, you just built a fully-featured web scraper to scrape jobs data from Indeed.
Launch it with:
python scraper.py
Wait some minutes for the script to complete
At the end of the scraping process, a jobs.json
file will appear in the root folder of your project. Open it and you will see:
{
"date": "2023-09-02 19:56:44",
"jobs": [
{
"posted_at": "7 days ago",
"applications": "50+",
"title": "Software Engineer - All Levels",
"company_name": "Listrak",
"company_rating": "3",
"company_reviews": "5",
"location": "King of Prussia, PA",
"location_type": "Remote",
"apply_link": "https://www.indeed.com/applystart?jk=f27ade40cc1a3686&from=vj&pos=bottom&mvj=0&jobsearchTk=1h9bge7mbhdj0800&spon=0&sjdu=YmZE5d5THV8u75cuc0H6Y26AwfY51UOGmh3Z9h4OvXgPYWebWpM-4nO05Ssl8I8z-BhdrQogdzP3xc9-PmOQTQ&vjfrom=vjs&astse=16430083478063d1&assa=2381",
"pay": null,
"job_type": null,
"benefits": [
"Gym membership",
"Paid time off"
],
"description": "About Listrak:\nWe are a SaaS company that offers an integrated digital marketing platform trusted by 1,000+ leading retailers and brands for email, text message marketing, identity resolution, behavioral triggers and cross-channel orchestration. Our HQ is in (omitted for brevity...)"
},
// omitted for brevity...
{
"posted_at": "9 days ago",
"applications": null,
"title": "Software Engineer, Front End (Hybrid-Remote)",
"company_name": "Weill Cornell Medicine",
"company_rating": "3.4",
"company_reviews": "41",
"location": "New York, NY 10021",
"location_type": "Remote",
"apply_link": "https://www.indeed.com/applystart?jk=1a53a17f1faeae92&from=vj&pos=bottom&mvj=0&jobsearchTk=1h9bge7mbhdj0800&spon=0&sjdu=YmZE5d5THV8u75cuc0H6Y26AwfY51UOGmh3Z9h4OvXgZADiLYj9Y4htcvtDy_iaWMIfcMu539kP3i1FMxIq2rA&vjfrom=vjs&astse=90a9325429efdf13&assa=4615",
"pay": "$99,800 - $123,200 a year",
"job_type": null,
"benefits": null,
"description": "Title: Software Engineer, Front End (Hybrid-Remote)\nTitle: Software Engineer, Front End (Hybrid-Remote)\nLocation: Upper East Side\nOrg Unit: Olivier Elemento Lab\nWork Days: Monday-Friday\nExemption Status: Exempt\nSalary Range: $99,800.00 - $123,200.00\nAs (omitted for brevity...)"
}
}
Congrats! You just learned how to scrape Indeed with Python!
Conclusion
In this tutorial, you understood why Indeed is one the best job portals on the web and how to extract data from it. In particular, you saw how to build a Python scraper that can retrieve job openings data from it.
As shown here, scraping Indeed is not the easiest task. The site comes with sneaky anti-scraping protection that might block your script. When dealing with such sites, you need a controllable browser that is automatically able to handle CAPTCHAs, fingerprinting, automated retries, and more for you. This is exactly what our new Scraping Browser solution is all about!
Don’t want to deal with web scraping at all but are interested in jobs data? Explore our Indeed datasets and our job postings dataset. Register now and start your free trial.
No credit card required
Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.