How to Scrape Job Postings Data

Follow this step-by-step tutorial and learn how to build a web scraping Indeed Python script to automatically retrieve data about job openings.
25 min read
How to scrape jobs data

Follow this step-by-step tutorial and learn how to build a web scraping Indeed Python script to automatically retrieve data about job openings.

This guide will cover:

Why Scrape Jobs Data From the Web?

Web scraping jobs data from the Web is useful for several reasons, including:

  • Market research: Allow businesses and job market analysts to gather information on industry trends. For example, this involves which skills are in high demand or the geographic regions experiencing job growth. It also enables you to monitor competitors’ hiring activities.
  • Streamlining job search and matching: Help job seekers search job listings from multiple sources to find positions that match their qualifications and preferences.
  • Recruitment and HR optimization: Support the recruitment process by facilitating hiring and helping to understand market salary trends and benefits sought by candidates.

Thus, job data is useful for both employers and job seekers.

When it comes to scraping job listings, there is one essential aspect to stress. The target platform needs to be public. In other terms, it must allow even non-logged-in users to perform job searches. This is because scraping data under a login wall can get you in trouble for legal reasons

That means taking LinkedIn out of the equation. What other job platforms remain? Indeed, one of the leading online job platforms!

Libraries and Tools for Scraping Indeed

Python is considered one of the best languages for scraping thanks to its syntax, ease of use, and rich ecosystem of libraries. So, let’s go for it. Check out our guide on web scraping with Python.

You now need to choose the right scraping libraries out of the many available. To make an informed decision, explore Indeed in your browser. You will notice that most of the data on the site is retrieved after interaction. This means that the site heavily on AJAX to load and update content dynamically without using page reloads. To do web scraping on such a site you need a tool that is able to run JavaScript. That tool is Selenium!

Selenium makes it possible to scrape dynamic websites in Python. It renders sites in a controllable web browser, performing operations as you instruct it. Thanks to Selenium, you can scrape data even if the target site uses JavaScript for rendering or data retrieval.

Learn how to scrape job postings off websites like Indeed!

Scraping Jobs Data From Indeed With Selenium

Follow this step-by-step tutorial and see how to build a web scraping Indeed Python script.

Step 1: Project setup

Before web scraping jobs, make sure you meet these prerequisites:

You now have everything you need to set up a Python project!

Open the terminal and launch the following commands to:

  1. Create an indeed-scraper folder 
  2. Enter it
  3. Initialize it with a Python virtual environment

mkdir indeed-scraper
cd indeed-scraper
python -m venv env

On Linux or macOS, run the command below to activate the environment:

./env/bin/activate

While on Windows, execute:

env\Scripts\activate.ps1

Next, initialize a scraper.py file containing the line below in the project folder:

print("Hello, World!")

Right now, it only prints “Hello, World!” but it will soon contain the Indeed scraping logic.

Launch it to verify that it works with:

python scraper.py

If all went as planned, it should print this message in the terminal:

Hello, World!

Now that you know that the script works, open the project folder in your Python IDE.

Well done! Get ready to write some Python code!

Step 2: Install the scraping libraries

As mentioned earlier, Selenium is a great tool when it comes to web scraping job postings from Indeed. Run the command below in the activated Python virtual environment to add it to the project’s dependencies:

pip install selenium

This might take a while, so be patient.

Please note that this tutorial refers to Selenium 4.11.2, which comes with automatic driver detection capabilities. If you have an older version of Selenium installed on your PC, update it with:

pip install selenium -U

Now, clear scraper.py. Then, import the package and initialize a Selenium scraper with:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# set up a controllable Chrome instance
# in headless mode
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
    service=service,
    options=options
)

# scraping logic...

# close the browser and free up the resources
driver.quit()

This script instantiates an instance of WebDriver to programmatically control a Chrome instance. The browser will be opened behind the scenes in headless mode, which means with no GUI. That is a common setup for production. If you instead prefer to follow the operations run by the web scraping jobs script on the page, comment out that option. This is useful in development.

Make sure that your Python IDE does not report any errors. Ignore the warnings you may receive because of the unused imports. You are about to use the libraries to extract repository data from GitHub!

Perfect! Time to build your web scraping Indeed Python scraper.

Step 3: Connect to the target web page

Open Indded and search for jobs you are interested in. In this guide, you will see how to scrape remote job postings for software engineers in New York. Keep in mind that any other Indeed job search will do. The scraping logic will be the same.

Here is what the target page looks like in the browser as of this writing:

Indeed gif: remote software engineer jobs in New York

Specifically, this is what the URL of the target page looks like:

https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100

As you can see, it is a dynamic URL that changes based on some query parameters.

You can then use Selenium to connect to the target page with:

driver.get("https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100")

The get() function instructs the browser to visit the page specified by the URL passed as a parameter.

After opening the page, you should set the window size to ensure that every element is visible:

driver.set_window_size(1920, 1080)

This is what your scraping Indeed script looks like so far:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# set up a controllable Chrome instance
# in headless mode
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
    service=service,
    options=options
)

# set the window size to make sure pages
# will not be rendered in responsive mode
driver.set_window_size(1920, 1080)

# open the target page  in the browser
driver.get("https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100")

# scraping logic...

# close the browser and free up the resources
driver.quit()

Comment out the option to enable headless mode and launch the script. It will open the window below for a fraction of a second before closing:

Selenium script for scraping Indeed's software engineer listings in New York

Note the “Chrome is being controlled by automated software” disclaimer. That ensures Selenium is working as expected.

Step 4: Familiarize yourself with the page structure

Before diving into scraping, there is another crucial step to carry out. Scraping data from a site involves selecting HTML elements and extracting data from them. Finding a way to get the desired nodes from the DOM is not always easy. Here is why you should spend some time analyzing the page structure to understand how to define an effective selection strategy.

Open your browser and visit the Indeed job search page. Right-click on any element and select the “Inspect” option to open the DevTools of your browser:

Screenshot Using 'Inspect' in the browser on the Indeed job search page

Here, you will see that most elements containing interesting data have CSS classes such as the following:

  • css-j45z4fcss-1m4cuuf, …
  • e37uo190eu4oa1w0, …
  • job_f27ade40cc1a3686job_1a53a17f1faeae92, …

Since these appear to be randomly generated at compile time, you should not rely on them for scraping. Instead, you should base the selection logic on classes such as:

  • jobsearch-JobInfoHeader-title
  • date
  • cardOutline

Or IDs like:

  • companyRatings
  • applyButtonLinkContainer
  • jobDetailsSection

Also, note that some nodes have unique HTML attributes:

  • data-company-name
  • data-testid

That is useful information to keep in mind for web scraping jobs from Indeed. Interact with the page to study how it reacts and what data it shows. You will realize that different job openings have different info attributes.

Keep inspecting the target site and familiarize yourself with its DOM structure until you feel ready to move on.

Step 5: Start extracting the job data

A single Indeed search page involves several job openings. So, you need an array to keep track of the jobs scraped from the page:

jobs = []

As you must have noticed in the previous step, the job postings are shown in .cardOutline cards:

Job postings displayed in .cardOutline cards on Indeed

Select them all with:

job_cards = driver.find_elements(By.CSS_SELECTOR, ".cardOutline")

The find_elements() method from Selenium allows you to locate web elements on a web page. Similarly, there is also the find_element() method to get the first node that matches the selection query.

By.CSS_SELECTOR instructs the driver to use a CSS selector strategy. Selenium also supports:

  • By.ID: To search for an element by the id HTML attribute
  • By.TAG_NAME: To search for elements based on their HTML tag
  • By.XPATH: To search for elements via an XPath expression

Import By with:

from selenium.webdriver.common.by import By

Iterate over the list of job cards, and initialize a Python dictionary where to store the job details:

for job_card in job_cards:
    # initialize a dictionary to store the scraped job data
    job = {}
    # job data extraction logic...

A job posting can have several attributes. Since only a small portion of them are mandatory, initialize a list of variables with default values right away:

posted_at = None
applications = None
title = None
company_name = None
company_rating = None
company_reviews = None
location = None
location_type = None
apply_link = None
pay = None
job_type = None
benefits = None
description = None

Now that you are familiar with the page, you know that some details are in the outline job card. Others are instead in the details tab that show up upon interaction.

For example, the creation date and the number of applications are in the summary tab:

Summary' tab showing creation date and number of applications

Extract them both with:

try:
    date_element = job_card.find_element(By.CSS_SELECTOR, ".date")
    date_element_text = date_element.text
    posted_at_text = date_element_text

    if "•" in date_element_text:
        date_element_text_array = date_element_text.split("•")
        posted_at_text = date_element_text_array[0]
        applications = date_element_text_array[1] \
            .replace("applications", "") \
            .replace("in progress", "") \
            .strip()

    posted_at = posted_at_text \
        .replace("Posted", "") \
        .replace("Employer", "") \
        .replace("Active", "") \
        .strip()
except NoSuchElementException:
    pass

This snippet highlights some patterns that are key to web scraping job postings from Indeed. As most info elements are optional, you must protect against the following error:

selenium.common.exceptions.NoSuchElementException: Message: no such element

Selenium throws it when trying to select an HTML element that is not currently on the page.

Import the exception with:

from selenium.common import NoSuchElementException

The try ... catch instruction ensures that if the target element is not on the DOM, the script will continue without failure.

Also, some job information is contained in strings like:

<info_1> • <info_2>

If <info_2> is missing, the string format is instead:

<info_1>

Thus, you need to change the data extraction logic based on the presence of the "``•``" character.

Given an HTML element, you can access its text content with the text attribute. Use the replace() Python strings to clean out the collect strings.

Step 6: Deal with Indeed anti-scraping measures

Indeed adopt some techniques and technologies to prevent bots from accessing its data. For example, when interacting with the job cards it tends to open this modal from time to time:

Modal appearing on Indeed as an anti-scraping measure

This popup blocks interaction. If not properly addressed, it will then stop your Selenium Indeed script. Inspect it in the DevTools and put your attention on the close button:

Popup in Indeed disrupting Selenium script, highlighting the close button in DevTools

Close this modal in Selenium with:

try:
    dialog_element = driver.find_element(By.CSS_SELECTOR, "[role=dialog]")
    close_button = dialog_element.find_element(By.CSS_SELECTOR, ".icl-CloseButton")
    close_button.click()
except NoSuchElementException:
    pass

The click() method from Selenium enables you to click on the selected element in the controlled browser.

Great! This will close the popup and let you continue the interaction.

Another data protection technique to seriously keep into account is Cloudflare. When interacting too much with the page and producing too many requests, Indeed will show you this anti-bot screen:

Cloudflare anti-bot screen on Indeed after excessive interactions

Solving Cloudflare CAPTCHAs from Selenium is a very challenging task that requires a premium product. Scraping Indeed is not that easy, after all. Fortunately, you can avoid them by introducing some random delays in your script.

Make sure the last operation in your for loop is:

time.sleep(random.uniform(1, 5))

This will stop the script from a random number of seconds from 1 to 5.

Import the required packages from the Python Standard Library with:

import random
import time

Way to go! Nothing will stop your automated script from scraping Indeed.

Step 7: Open the job details card

When you click on an outline job card, Indeed performs an AJAX call to retrieve the details on the fly. While waiting for this data, the page shows an animated placeholder:

Animated placeholder on Indeed while loading job details after clicking an outline card

You can verify that the details sections have been loaded when the element below is on the page:

Element indicating the job details section has loaded on Indeed

So, to get access to the job details data in Selenium you have to:

  • Perform the click operation
  • Wait for the page to contain the data of interest

Achieve that with:

job_card.click()

try:
    title_element = WebDriverWait(driver, 5) \
        .until(EC.presence_of_element_located((By.CSS_SELECTOR, ".jobsearch-JobInfoHeader-title")))
    title = title_element.text.replace("\n- job post", "")
except NoSuchElementException:
    continue

The WebDriverWait object from Selenium allows you to wait for a specific condition to happen. In this case, the script waits up to 5 seconds for .jobsearch-JobInfoHeader-title to be on the page. After that, it will throw a TimeoutException.

Note that the above snippet also retrieves the title of the job opening.

Import WebDriverWait and EC:

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

From now on, the element to focus on is this detail column:

Selenium script using WebDriverWait for '.jobsearch-JobInfoHeader-title', highlighting the job title retrieval and the detail column

Select it with:

job_details_element = driver.find_element(By.CSS_SELECTOR, ".jobsearch-RightPane")

Fantastic! You are all set to scrape some job data!

Step 8: Extract the job details

Time to populate the variables we defined in step 4 with some job data.

Get the name of the company behind the job opening:

try:
    company_link_element = job_details_element.find_element(By.CSS_SELECTOR, "div[data-company-name='true'] a")
    company_name = company_link_element.text
except NoSuchElementException:
    pass

Then, extract information on the company’s user ratings and number of reviews:

Extracting company's user ratings and number of reviews on Indeed

As you can see, there is not an easy way to access the element storing the number of reviews.

try:
    company_rating_element = job_details_element.find_element(By.ID, "companyRatings")
    company_rating = company_rating_element.get_attribute("aria-label").split("out")[0].strip()
    company_reviews_element = job_details_element.find_element(By.CSS_SELECTOR, "[data-testid='inlineHeader-companyReviewLink']")
    company_reviews = company_reviews_element.text.replace(" reviews", "")
except NoSuchElementException:
    pass

Next, focus on the company location:

Extracting the company location

Again, you need to apply the "``•``" pattern mentioned in step 4:

try:
    company_location_element = job_details_element.find_element(By.CSS_SELECTOR,
                                                                "[data-testid='inlineHeader-companyLocation']")
    company_location_element_text = company_location_element.text

    location = company_location_element_text

    if "•" in company_location_element_text:
        company_location_element_text_array = company_location_element_text.split("•")
        location = company_location_element_text_array[0]
        location_type = company_location_element_text_array[1]
except NoSuchElementException:
    pass

Since you may want to quickly apply for the job, take a look at the Indeed “Apply on company site” button as well:

Apply on company site' button on Indeed

Retrieve the button’s target URL with:

try:
    apply_link_element = job_details_element.find_element(By.CSS_SELECTOR, "#applyButtonLinkContainer button")
    apply_link = apply_link_element.get_attribute("href")
except NoSuchElementException:
    pass

The get_attribute() from Selenium returns the value of the specified HTML attribute.

Now, the tricky part begins.

If you inspect the “Job details” section, you will notice that there is not an easy way to select the pay and job type elements:

'Job details' section on Indeed showing pay and job type elements

What you can do is:

  1. Get all <div>s inside the “Job details” <div>
  2. Iterate over them
  3. If the current <div>’s text contains “Pay” or “Job Type,” get the next sibling
  4. Extract the data of interest

In other words, you have to implement the logic below:

for div in job_details_element.find_elements(By.CSS_SELECTOR, "#jobDetailsSection div"):
    if div.text == "Pay":
        pay_element = div.find_element(By.XPATH, "following-sibling::*")
        pay = pay_element.text
    elif div.text == "Job Type":
        job_type_element = div.find_element(By.XPATH, "following-sibling::*")
        job_type = job_type_element.text

Selenium does not provide a utility method for accessing the siblings of a node. What you can do instead is to use the following-sibling::* Xpath expression.

Now, focus on the job’s benefits. Usually, there are more than one:

To retrieve them all, you need to initialize a list and populate it with:

try:
    benefits_element = job_details_element.find_element(By.ID, "benefits")
    benefits = []
    for benefit_element in benefits_element.find_elements(By.TAG_NAME, "li"):
        benefit = benefit_element.text
        benefits.append(benefit)
except NoSuchElementException:
    pass

Finally, get the raw job description:

Raw job description section on Indeed

Extract the text of the description with:

try:
    description_element = job_details_element.find_element(By.ID, "jobDescriptionText")
    description = description_element.text
except NoSuchElementException:
    pass

Populate the job dictionary and add it to the jobs list:

job["posted_at"] = posted_at
job["applications"] = applications
job["title"] = title
job["company_name"] = company_name
job["company_rating"] = company_rating
job["company_reviews"] = company_reviews
job["location"] = location
job["location_type"] = location_type
job["apply_link"] = apply_link
job["pay"] = pay
job["job_type"] = job_type
job["benefits"] = benefits
job["description"] = description

jobs.append(job)

You can also add a log instruction to verify that the script works as expected:

print(job)

Run the script:

python scraper.py

This will produce an output similar to:

{'posted_at': '17 days ago', 'applications': '50+', 'title': 'Software Support Engineer', 'company_name': 'Integrated DNA Technologies (IDT)', 'company_rating': '3.5', 'company_reviews': '95', 'location': 'New York, NY 10001', 'location_type': 'Remote', 'apply_link': 'https://www.indeed.com/applystart?jk=c00120130a9c933b&from=vj&pos=bottom&mvj=0&jobsearchTk=1h9fpft0fj3t3800&spon=0&sjdu=YmZE5d5THV8u75cuc0H6Y26AwfY51UOGmh3Z9h4OvXiYhWlsa56nLum9aT96NeA9XAwdulcUk0atwlDdDDqlBQ&vjfrom=tp-semfirstjob&astse=bcf3778ad128bc26&assa=2447', 'pay': '$80,000 - $100,000 a year', 'job_type': 'Full-time', 'benefits': ['401(k)', '401(k) matching', 'Dental insurance', 'Health insurance', 'Paid parental leave', 'Paid time off', 'Parental leave', 'Vision insurance'], 'description': "Integrated DNA Technologies (IDT) is the leading manufacturer of custom oligonucleotides and proprietary technologies for (omitted for brevity...)"}

Et voilà! You just learned how to scrape job postings off websites.

Step 9: Scrape multiple job opening pages

A typical job search on Indeed produces a paginated list with dozens of results. Saw how to scrape each page! 

First, inspect a page and note how Indeed behaves. In detail, it shows the following element when there is a next page available.

Element on Indeed indicating the availability of a next page

Otherwise, the next page element is missing:

Indeed page missing the 'next page' element

Keep in mind that Indeed may return a list with hundreds of job openings. Since you do not want your script to run forever, consider adding a limit to the number of pages scraped.

Implement web crawling on Indeed in Selenium with:

pages_scraped = 0
pages_to_scrape = 5
while pages_scraped < pages_to_scrape:
    job_cards = driver.find_elements(By.CSS_SELECTOR, ".cardOutline")

    for job_card in job_cards:
        # scraping logic...

    pages_scraped += 1

    # if this is not the last page, go to the next page
    # otherwise, break the while loop
    try:
        next_page_element = driver.find_element(By.CSS_SELECTOR, "a[data-testid=pagination-page-next]")
        next_page_element.click()
    except NoSuchElementException:
        break

The Indeed scraper will now keep looping until it reaches the last page or goes through 5 pages.

Step 10: Export scraped data to JSON

Right now, the scraped data is stored in a list of Python dictionaries. Export it to JSON to make it easier to share and read.

First, create an output object:

output = {
    "date": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    "jobs": jobs
}

The date attribute is required because the job opening publication dates are in the format “<X> days ago.” Without some context on the day the jobs data was scraped, it would be difficult to understand it.

Remeber to import datetime:

from datetime import datetime

Then, export it with:

import json

# scraping logic...

with open("jobs.json", "w") as file:
    json.dump(output, file, indent=4)

The above snippet initializes a jobs.json output file with open() and it populates it with JSON data via json.dump(). Check out our article to learn more about how to parse and serialize data to JSON in Python.

The json package comes from Python Standard Library, so you do not even need to install an extra dependency to achieve the objective.

Wow! You started from raw job data contained in a webpage and now have semi-structured JSON data. You are ready to take a look at the entire web scraping Indeed Python script.

Step 11: Put it all together

Here is the complete scraper.py file:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common import NoSuchElementException
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import random
import time
from datetime import datetime
import json

# set up a controllable Chrome instance
# in headless mode
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
    service=service,
    options=options
)

# open the target page  in the browser
driver.get("https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY&sc=0kf%3Aattr%28DSQF7%29%3B&radius=100")
# set the window size to make sure pages
# will not be rendered in responsive mode
driver.set_window_size(1920, 1080)

# a data structure where to store the job openings
# scraped from the page
jobs = []

pages_scraped = 0
pages_to_scrape = 3
while pages_scraped < pages_to_scrape:
    # select the job posting cards on the page
    job_cards = driver.find_elements(By.CSS_SELECTOR, ".cardOutline")

    for job_card in job_cards:
        # initialize a dictionary to store the scraped job data
        job = {}

        # initialize the job attributes to scrape
        posted_at = None
        applications = None
        title = None
        company_name = None
        company_rating = None
        company_reviews = None
        location = None
        location_type = None
        apply_link = None
        pay = None
        job_type = None
        benefits = None
        description = None

        # get the general job data from the outline card
        try:
            date_element = job_card.find_element(By.CSS_SELECTOR, ".date")
            date_element_text = date_element.text
            posted_at_text = date_element_text

            if "•" in date_element_text:
                date_element_text_array = date_element_text.split("•")
                posted_at_text = date_element_text_array[0]
                applications = date_element_text_array[1] \
                    .replace("applications", "") \
                    .replace("in progress", "") \
                    .strip()

            posted_at = posted_at_text \
                .replace("Posted", "") \
                .replace("Employer", "") \
                .replace("Active", "") \
                .strip()
        except NoSuchElementException:
            pass

        # close the anti-scraping modal
        try:
            dialog_element = driver.find_element(By.CSS_SELECTOR, "[role=dialog]")
            close_button = dialog_element.find_element(By.CSS_SELECTOR, ".icl-CloseButton")
            close_button.click()
        except NoSuchElementException:
            pass

        # load the job details card
        job_card.click()

        # wait for the job details section to load after the click
        try:
            title_element = WebDriverWait(driver, 5) \
                .until(EC.presence_of_element_located((By.CSS_SELECTOR, ".jobsearch-JobInfoHeader-title")))
            title = title_element.text.replace("\n- job post", "")
        except NoSuchElementException:
            continue

        # extract the job details
        job_details_element = driver.find_element(By.CSS_SELECTOR, ".jobsearch-RightPane")

        try:
            company_link_element = job_details_element.find_element(By.CSS_SELECTOR, "div[data-company-name='true'] a")
            company_name = company_link_element.text
        except NoSuchElementException:
            pass

        try:
            company_rating_element = job_details_element.find_element(By.ID, "companyRatings")
            company_rating = company_rating_element.get_attribute("aria-label").split("out")[0].strip()
            company_reviews_element = job_details_element.find_element(By.CSS_SELECTOR, "[data-testid='inlineHeader-companyReviewLink']")
            company_reviews = company_reviews_element.text.replace(" reviews", "")
        except NoSuchElementException:
            pass

        try:
            company_location_element = job_details_element.find_element(By.CSS_SELECTOR,
                                                                        "[data-testid='inlineHeader-companyLocation']")
            company_location_element_text = company_location_element.text

            location = company_location_element_text

            if "•" in company_location_element_text:
                company_location_element_text_array = company_location_element_text.split("•")
                location = company_location_element_text_array[0]
                location_type = company_location_element_text_array[1]
        except NoSuchElementException:
            pass

        try:
            apply_link_element = job_details_element.find_element(By.CSS_SELECTOR, "#applyButtonLinkContainer button")
            apply_link = apply_link_element.get_attribute("href")
        except NoSuchElementException:
            pass

        for div in job_details_element.find_elements(By.CSS_SELECTOR, "#jobDetailsSection div"):
            if div.text == "Pay":
                pay_element = div.find_element(By.XPATH, "following-sibling::*")
                pay = pay_element.text
            elif div.text == "Job Type":
                job_type_element = div.find_element(By.XPATH, "following-sibling::*")
                job_type = job_type_element.text

        try:
            benefits_element = job_details_element.find_element(By.ID, "benefits")
            benefits = []
            for benefit_element in benefits_element.find_elements(By.TAG_NAME, "li"):
                benefit = benefit_element.text
                benefits.append(benefit)
        except NoSuchElementException:
            pass

        try:
            description_element = job_details_element.find_element(By.ID, "jobDescriptionText")
            description = description_element.text
        except NoSuchElementException:
            pass

        # store the scraped data
        job["posted_at"] = posted_at
        job["applications"] = applications
        job["title"] = title
        job["company_name"] = company_name
        job["company_rating"] = company_rating
        job["company_reviews"] = company_reviews
        job["location"] = location
        job["location_type"] = location_type
        job["apply_link"] = apply_link
        job["pay"] = pay
        job["job_type"] = job_type
        job["benefits"] = benefits
        job["description"] = description
        jobs.append(job)

        # wait for a random number of seconds from 1 to 5
        # to avoid rate limiting blocks
        time.sleep(random.uniform(1, 5))

    # increment the scraping counter
    pages_scraped += 1

    # if this is not the last page, go to the next page
    # otherwise, break the while loop
    try:
        next_page_element = driver.find_element(By.CSS_SELECTOR, "a[data-testid=pagination-page-next]")
        next_page_element.click()
    except NoSuchElementException:
        break

# close the browser and free up the resources
driver.quit()

# produce the output object
output = {
    "date": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    "jobs": jobs
}

# export it to JSON
with open("jobs.json", "w") as file:
    json.dump(output, file, indent=4)

In less than 200 lines of code, you just built a fully-featured web scraper to scrape jobs data from Indeed.

Launch it with:

python scraper.py

Wait some minutes for the script to complete

At the end of the scraping process, a jobs.json file will appear in the root folder of your project. Open it and you will see:

{
    "date": "2023-09-02 19:56:44",
    "jobs": [
        {
            "posted_at": "7 days ago",
            "applications": "50+",
            "title": "Software Engineer - All Levels",
            "company_name": "Listrak",
            "company_rating": "3",
            "company_reviews": "5",
            "location": "King of Prussia, PA",
            "location_type": "Remote",
            "apply_link": "https://www.indeed.com/applystart?jk=f27ade40cc1a3686&from=vj&pos=bottom&mvj=0&jobsearchTk=1h9bge7mbhdj0800&spon=0&sjdu=YmZE5d5THV8u75cuc0H6Y26AwfY51UOGmh3Z9h4OvXgPYWebWpM-4nO05Ssl8I8z-BhdrQogdzP3xc9-PmOQTQ&vjfrom=vjs&astse=16430083478063d1&assa=2381",
            "pay": null,
            "job_type": null,
            "benefits": [
                "Gym membership",
                "Paid time off"
            ],
            "description": "About Listrak:\nWe are a SaaS company that offers an integrated digital marketing platform trusted by 1,000+ leading retailers and brands for email, text message marketing, identity resolution, behavioral triggers and cross-channel orchestration. Our HQ is in (omitted for brevity...)"
        },
        // omitted for brevity...
        {
            "posted_at": "9 days ago",
            "applications": null,
            "title": "Software Engineer, Front End (Hybrid-Remote)",
            "company_name": "Weill Cornell Medicine",
            "company_rating": "3.4",
            "company_reviews": "41",
            "location": "New York, NY 10021",
            "location_type": "Remote",
            "apply_link": "https://www.indeed.com/applystart?jk=1a53a17f1faeae92&from=vj&pos=bottom&mvj=0&jobsearchTk=1h9bge7mbhdj0800&spon=0&sjdu=YmZE5d5THV8u75cuc0H6Y26AwfY51UOGmh3Z9h4OvXgZADiLYj9Y4htcvtDy_iaWMIfcMu539kP3i1FMxIq2rA&vjfrom=vjs&astse=90a9325429efdf13&assa=4615",
            "pay": "$99,800 - $123,200 a year",
            "job_type": null,
            "benefits": null,
            "description": "Title: Software Engineer, Front End (Hybrid-Remote)\nTitle: Software Engineer, Front End (Hybrid-Remote)\nLocation: Upper East Side\nOrg Unit: Olivier Elemento Lab\nWork Days: Monday-Friday\nExemption Status: Exempt\nSalary Range: $99,800.00 - $123,200.00\nAs (omitted for brevity...)"
        }
}

Congrats! You just learned how to scrape Indeed with Python!

Conclusion

In this tutorial, you understood why Indeed is one the best job portals on the web and how to extract data from it. In particular, you saw how to build a Python scraper that can retrieve job openings data from it.

As shown here, scraping Indeed is not the easiest task. The site comes with sneaky anti-scraping protection that might block your script. When dealing with such sites, you need a controllable browser that is automatically able to handle CAPTCHAs, fingerprinting, automated retries, and more for you. This is exactly what our new Scraping Browser solution is all about! 

Don’t want to deal with web scraping at all but are interested in jobs data? Explore our Indeed datasets and our job postings dataset.

Not sure what product you need? Talk to one of our data experts.